Skip to content

Conversation

@Lee-W
Copy link
Member

@Lee-W Lee-W commented Dec 18, 2025

Why

On worker startup, tasks can fail if the Dag or task cannot be loaded due to transient infrastructure issues (e.g., missing Dag files, network or filesystem problems). Currently, the worker calls sys.exit(1) in this case. Since the task has not actually run, Airflow should reschedule it intelligently without counting against the task's retry limit.

What

Instead of exiting, the task runner now raises AirflowRescheduleException to mark the task as UP_FOR_RESCHEDULE. This allows the scheduler to reschedule the task after a configurable delay, retrying safely without increasing the task's retry count.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:DAG-processing area:Scheduler including HA (high availability) scheduler area:task-sdk labels Dec 18, 2025
@Lee-W Lee-W changed the title refactor: minor refactors fix: re-queue tasks if Dag cannot be found in the worker Dec 18, 2025
@Lee-W Lee-W force-pushed the requeue branch 3 times, most recently from d162801 to 1e81740 Compare December 22, 2025 01:47
@Lee-W Lee-W force-pushed the requeue branch 6 times, most recently from 81f85e8 to 603ac16 Compare December 31, 2025 06:38
@Lee-W Lee-W changed the title fix: re-queue tasks if Dag cannot be found in the worker Reschedule tasks on worker startup Dag load failures instead of exiting Dec 31, 2025
@Lee-W Lee-W marked this pull request as ready for review December 31, 2025 09:49
@Lee-W Lee-W force-pushed the requeue branch 4 times, most recently from dfba98a to b489c01 Compare January 6, 2026 07:04
@potiuk
Copy link
Member

potiuk commented Jan 6, 2026

This is pretty cool "reliability" feature. I think that should also be something that we should implement in a number of other places, because it can provide resilience to transient issues. But I thinkg it needs someone who has deeper understanding of deps handling so I will refrain with approving it for some time (though it's tempting).

One thing to add though - I think we should have some better way of signalling that those issues are happening - metrics for example, or maybe even a warning in the UI if it happens displayed as dismissable notification? While i think it's cool we handle this on our own, it might hide some systemic issues that deployment manager should handle, so while we should let it self-recover, we should also notify about those issues happening pretty aggreesively.

@potiuk
Copy link
Member

potiuk commented Jan 6, 2026

Potentially such notification should lead to doc URL describing potential reasons and remediations.

@Lee-W Lee-W requested a review from uranusjr January 20, 2026 02:27
@Lee-W Lee-W force-pushed the requeue branch 2 times, most recently from 42a910b to 66394a4 Compare January 21, 2026 09:50
@Lee-W Lee-W merged commit d98a20a into apache:main Jan 22, 2026
101 checks passed
@Lee-W Lee-W deleted the requeue branch January 22, 2026 06:55
Lee-W added a commit to astronomer/airflow that referenced this pull request Jan 22, 2026
@Lee-W
Copy link
Member Author

Lee-W commented Jan 22, 2026

backport through #60926

Lee-W added a commit to astronomer/airflow that referenced this pull request Jan 22, 2026
Lee-W added a commit to astronomer/airflow that referenced this pull request Jan 22, 2026
Lee-W added a commit to astronomer/airflow that referenced this pull request Jan 22, 2026
suii2210 pushed a commit to suii2210/airflow that referenced this pull request Jan 26, 2026
shreyas-dev pushed a commit to shreyas-dev/airflow that referenced this pull request Jan 29, 2026
jhgoebbert pushed a commit to jhgoebbert/airflow_Owen-CH-Leung that referenced this pull request Feb 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:DAG-processing area:Scheduler including HA (high availability) scheduler area:task-sdk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants