feat(stealing): add margin-of-improvement bound to prevent work-stealing thrash#9262
feat(stealing): add margin-of-improvement bound to prevent work-stealing thrash#9262prince8273 wants to merge 10 commits into
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 31 files ± 0 31 suites ±0 10h 55m 6s ⏱️ - 12m 4s For more details on these failures, see this check. Results for commit 617590f. ± Comparison against base commit deb1004. ♻️ This comment has been updated with latest results. |
f052680 to
2e48bea
Compare
CI Status NoteThe failing checks are pre-existing flaky tests unrelated to this PR. The dask/distributed test report The GitHub Actions bot confirms: 3 ❌ -1 against base commit cf508b9 — |
|
@fjetter Could you take a look when you get a chance? |
… death When a worker drops off the cluster unexpectedly (e.g., due to an OOM kill), the scheduler tracks the processing_keys but previously did not log them to the console. This change surfaces exactly which tasks were interrupted, significantly improving debugging provenance for cluster hangs and memory crashes.
- Add reject_count_margin_total metric to WorkStealing.metrics - Add observability logging for interrupted tasks in scheduler.py - Add test_reject_count_margin_metric to test_steal.py - Revert accidental range() changes in test_steal.py Signed-off-by: prince8273 <princesingh29757@gmail.com>
Signed-off-by: prince8273 <princesingh29757@gmail.com>
Signed-off-by: prince8273 <princesingh29757@gmail.com>
4026b1c to
617590f
Compare
Problem
The scheduler would steal a task whenever the thief was even 1ms
faster than the victim. For data-heavy, compute-light tasks this
caused chronic thrashing — transfer costs routinely exceeded savings.
Change
Added a margin constraint to
balance():The thief must now promise a speedup of at least 50% of the network
transfer cost. Marginal steals that are net-negative under realistic
network jitter are suppressed.
Observability
Added
reject_count_margin_total(keyed by level) toWorkStealing.metricsso operators can measure exactly how manythrashing steals are being prevented. A
logger.debugline isemitted on each rejection with full task and margin details.
Tests
Added
test_reject_count_margin_metric— simulates a highcomm_cost/low compute scenario, triggers
balance(), and assertsreject_count_margin_total >= 1.