Skip to content

Add DAG-level automatic retries (issue #60866)#63907

Closed
dv-gorasiya wants to merge 4 commits intoapache:mainfrom
dv-gorasiya:fix-dag-level-retries-60866
Closed

Add DAG-level automatic retries (issue #60866)#63907
dv-gorasiya wants to merge 4 commits intoapache:mainfrom
dv-gorasiya:fix-dag-level-retries-60866

Conversation

@dv-gorasiya
Copy link
Copy Markdown
Contributor

@dv-gorasiya dv-gorasiya commented Mar 19, 2026

Summary

Implements DAG-level automatic retries for issue #60866.

Changes

  • Add dag_try_number on dag_run + migration
  • max_dag_retries / dag_retry_delay on SDK DAG + serialized DAG
  • On final failure with retries left: re-queue run, clear failed TIs, optional delay on run_after
  • Unit tests in test_dagrun.py

Testing

pytest airflow-core/tests/unit/models/test_dagrun.py -k dag_level_retry

Closes #60866

Gen-AI: This PR was created with assistance from generative AI tools. I have reviewed the code, run the relevant tests, and take responsibility for the contribution.

@dv-gorasiya dv-gorasiya force-pushed the fix-dag-level-retries-60866 branch from 0b20b69 to 0d4ef7c Compare March 19, 2026 00:26
Copy link
Copy Markdown
Contributor

@yuseok89 yuseok89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI error seems to be related to migrations-ref.rst. Try running:

prek run --all-files

This should apply the hook changes and resolve the CI failure.

@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label Mar 20, 2026
@kaxil kaxil requested a review from Copilot April 2, 2026 00:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@jscheffl
Copy link
Copy Markdown
Contributor

jscheffl commented Apr 3, 2026

Why actually shall we need to add complexity for dag level retry if we have retries on task level already? I think the same can be achieved by default_args setting retries to all tasks.

@yuseok89
Copy link
Copy Markdown
Contributor

yuseok89 commented Apr 4, 2026

@jscheffl
My reading of this PR and issue is that it goes beyond what you get from default_args. Retries set through default_args are task retries. They only re-run the task that failed.
For example, with task_a >> task_b, if task_b fails, Airflow retries only task_b. task_a already succeeded and is not run again.
My understanding of this change is that it adds a DAG run level retry so the run can be tried again in a way that can include upstream work like task_a again, not only the last failed task. So setting retries in default_args for all tasks does not fully replace that behavior.
If I misunderstood the scheduler or model behavior, the author can correct me.

@dv-gorasiya
Please correct me if I've misunderstood how the PR behaves.
For reviewers, a short screen recording or a few screenshots would make the behavior much easier to grasp.

Copy link
Copy Markdown
Contributor

@yuseok89 yuseok89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dv-gorasiya
Please resolve the merge conflicts before we dig deeper into review.

@jscheffl
Copy link
Copy Markdown
Contributor

jscheffl commented Apr 4, 2026

@jscheffl My reading of this PR and issue is that it goes beyond what you get from default_args. Retries set through default_args are task retries. They only re-run the task that failed. For example, with task_a >> task_b, if task_b fails, Airflow retries only task_b. task_a already succeeded and is not run again. My understanding of this change is that it adds a DAG run level retry so the run can be tried again in a way that can include upstream work like task_a again, not only the last failed task. So setting retries in default_args for all tasks does not fully replace that behavior. If I misunderstood the scheduler or model behavior, the author can correct me.

Yes, okay then it is not a "small addition or bugfix" but something that should be first aligned with the development community if such use case should be supported and the additional compelxity is accepted. Scheduling is already complex and adding more parameters, loops and complexity is something that need to be accepted.

Can you please add a email to the devliust as [DISCUSS] whether this feature shall be accepted and if the method of implementation is also the right way?

@potiuk
Copy link
Copy Markdown
Member

potiuk commented Apr 4, 2026

Closing. This need discussion first

@potiuk potiuk closed this Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:DAG-processing area:db-migrations PRs with DB migration area:task-sdk ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DAG-Level Automatic Retries Based on Terminal Task Status

5 participants