Skip to content

Add repair_run action to manage_job_runs MCP tool#444

Open
jacksandom wants to merge 2 commits intomainfrom
fix/repair-job-run
Open

Add repair_run action to manage_job_runs MCP tool#444
jacksandom wants to merge 2 commits intomainfrom
fix/repair-job-run

Conversation

@jacksandom
Copy link
Copy Markdown
Collaborator

Summary

  • Adds repair action to manage_job_runs MCP tool, enabling retry of only failed tasks instead of re-running entire jobs
  • Implements repair_run() core function following the existing run_job_now pattern
  • Supports rerun_all_failed_tasks, rerun_dependent_tasks, rerun_tasks, and latest_repair_id

Closes #392

Problem

When a job run had failed tasks, the LLM had no way to repair the run via MCP. The repair action was missing from manage_job_runs, so attempts fell through to a ValueError. The LLM then fell back to run_now, re-running all tasks (including successful ones), wasting compute and time.

Changes

File Change
databricks-tools-core/.../jobs/runs.py New repair_run() function
databricks-tools-core/.../jobs/__init__.py Export repair_run
databricks-mcp-server/.../tools/jobs.py Import, 4 new params, "repair" action dispatch with error handling
databricks-tools-core/tests/.../conftest.py failing_notebook_path fixture
databricks-tools-core/tests/.../test_runs.py TestRepairRun class

Test plan

  • All 11 integration tests pass (including new TestRepairRun)
  • MCP smoke test: repair with rerun_all_failed_tasks=True returns repair_id
  • MCP smoke test: chained repair with latest_repair_id
  • Existing tests unaffected

@jacksandom jacksandom requested a review from calreynolds April 10, 2026 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: repair_run triggers full job run instead of repairing failed tasks

1 participant