-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Reschedule tasks on worker startup Dag load failures instead of exiting #59604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d162801 to
1e81740
Compare
81f85e8 to
603ac16
Compare
dfba98a to
b489c01
Compare
|
This is pretty cool "reliability" feature. I think that should also be something that we should implement in a number of other places, because it can provide resilience to transient issues. But I thinkg it needs someone who has deeper understanding of deps handling so I will refrain with approving it for some time (though it's tempting). One thing to add though - I think we should have some better way of signalling that those issues are happening - metrics for example, or maybe even a warning in the UI if it happens displayed as dismissable notification? While i think it's cool we handle this on our own, it might hide some systemic issues that deployment manager should handle, so while we should let it self-recover, we should also notify about those issues happening pretty aggreesively. |
|
Potentially such notification should lead to doc URL describing potential reasons and remediations. |
42a910b to
66394a4
Compare
…ng (apache#59604) (cherry picked from commit d98a20a)
|
backport through #60926 |
…ng (apache#59604) (cherry picked from commit d98a20a)
…ng (apache#59604) (cherry picked from commit d98a20a)
…ng (apache#59604) (cherry picked from commit d98a20a)
Why
On worker startup, tasks can fail if the Dag or task cannot be loaded due to transient infrastructure issues (e.g., missing Dag files, network or filesystem problems). Currently, the worker calls
sys.exit(1)in this case. Since the task has not actually run, Airflow should reschedule it intelligently without counting against the task's retry limit.What
Instead of exiting, the task runner now raises
AirflowRescheduleExceptionto mark the task asUP_FOR_RESCHEDULE. This allows the scheduler to reschedule the task after a configurable delay, retrying safely without increasing the task's retry count.^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.