Add deadlock detection with no-progress timeout by neubig · Pull Request #458 · OpenHands/benchmarks

neubig · 2026-02-26T16:43:55Z

Summary

Adds deadlock detection at the orchestrator level to catch when all workers are stuck and making no progress.

Problem

Even with per-instance timeouts, the evaluation job can deadlock if:

Instance timeouts fire but workers are stuck in blocking I/O
pool.shutdown(wait=True) waits forever for zombie workers
No futures complete for extended periods

Solution

No-Progress Timeout: Track last_progress_time for any completed or timed-out future. If no progress for 30 minutes (configurable via EVALUATION_NO_PROGRESS_TIMEOUT), log DEADLOCK DETECTED and terminate all pending instances.
Force Terminate on Timeout: When any instances timed out, use _cleanup_pool(wait=False) instead of pool.shutdown(wait=True) to forcefully terminate zombie workers.

New Log Messages

DEADLOCK DETECTED: No progress for X minutes with N pending instances. Force terminating stuck workers.

N instances timed out or deadlocked. Force terminating zombie workers.

Environment Variables

EVALUATION_NO_PROGRESS_TIMEOUT - Seconds without progress before assuming deadlock (default: 1800 = 30 minutes)

Changes

benchmarks/utils/evaluation.py: Add progress tracking and deadlock detection logic

Evidence

Cannot be tested in current environment:

Resource needed: Real evaluation environment with workloads that could deadlock
Reason: Deadlock detection requires actual stuck workers to trigger
Manual verification steps:
1. Run an evaluation with a known-deadlocking dataset/configuration
2. Set EVALUATION_NO_PROGRESS_TIMEOUT=300 (5 minutes) for faster testing
3. Verify DEADLOCK DETECTED appears in logs when workers hang
4. Verify error outputs are created for deadlocked instances
5. Verify pool shuts down (doesn't hang forever)

Detects when all workers are stuck by tracking time since last progress. If no futures complete or timeout for 30 minutes (configurable via EVALUATION_NO_PROGRESS_TIMEOUT), assumes deadlock and force terminates. Also force terminates zombie workers on pool shutdown when any instances timed out, since those workers are stuck in blocking I/O and will never return normally. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

Taste Rating: 🟡 Acceptable - solves real problem, needs tests

Analysis: The logic is sound - track progress, detect deadlock, force terminate. Pragmatic solution to a real production problem. Main issue: no tests for this critical new behavior.

Verdict: ✅ Worth merging after tests added

Key Insight: Good defensive programming for worker deadlocks, but untested deadlock detection is itself a potential source of production failures.

benchmarks/utils/evaluation.py

- Move EVALUATION_NO_PROGRESS_TIMEOUT config parsing outside retry loop (it won't change between attempts) - Add comprehensive test suite for deadlock detection: - test_deadlock_detection_triggers_on_no_progress - test_no_deadlock_when_progress_is_made - test_timed_out_count_increments_correctly - test_error_output_created_for_deadlocked_instances - test_pending_set_cleared_after_deadlock - test_timeout_from_env_var - test_default_timeout - Add comment explaining why force termination is necessary (zombie workers won't respond to graceful shutdown) Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟡 Acceptable - Solves real problem, but tests fundamentally flawed

The deadlock detection logic is sound and addresses a real production issue. However, the test suite has a critical flaw: it reimplements the detection logic instead of testing the actual evaluation.py code.

benchmarks/utils/evaluation.py

tests/test_deadlock_detection.py

…ethod - Add try/except for EVALUATION_NO_PROGRESS_TIMEOUT parsing with fallback - Extract _create_error_output_with_metadata helper to remove duplication - Add test for invalid env var handling - Add docstring note explaining test approach (unit tests verify logic patterns) Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

Taste Rating: 🟡 Acceptable - Pragmatic solution, but tests are smoke and mirrors

Analysis: The deadlock detection logic is sound and solves a real production problem. Helper method eliminates duplication nicely. Force-terminate approach is pragmatic. BUT: the test suite is fundamentally flawed - it reimplements the logic patterns instead of testing the actual evaluation.py code.

Verdict: ✅ Worth merging IF you accept the test coverage gap, ❌ Needs rework if you want real test coverage

Key Insight: You wrote 288 lines of tests that would still pass if someone deleted the deadlock detection from evaluation.py.

tests/test_deadlock_detection.py

benchmarks/utils/evaluation.py

- Fixed style inconsistency: now increment timed_out_count per-instance in deadlock detection loop for consistency with per-instance timeout handling above - Added TestEvaluatorDeadlockIntegration class with real integration tests that import and test the actual Evaluator class - Updated module docstring to describe both unit and integration tests Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

Taste Rating: 🟡 Acceptable - Pragmatic solution, but tests provide false confidence

Analysis: The deadlock detection logic is sound and addresses a real production problem. Helper method eliminates duplication. Force-terminate approach is pragmatic when workers are zombies.

Key Issue: Integration tests claim to verify real behavior but only assert the code doesn't hang. They don't verify error outputs are created, timed_out_count is incremented, or pending is cleared - giving false confidence about test coverage.

tests/test_deadlock_detection.py

benchmarks/utils/evaluation.py

…ions - Added explicit logging when pending_info is None (indicates bookkeeping bug) - Changed from silent skip to log + continue pattern for both timeout and deadlock - Improved integration test to verify error outputs are created with proper content - Test now asserts: 2 error outputs, 'deadlock' in error message, test_result is None Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟡 Acceptable - Solves real problem, but tests fundamentally flawed

The deadlock detection logic is sound and addresses a real production issue. However, the test suite has a critical flaw: it reimplements the detection logic instead of testing the actual evaluation.py code.

tests/test_deadlock_detection.py

benchmarks/utils/evaluation.py

- Renamed TestEvaluatorDeadlockIntegration -> TestDeadlockPatterns - Updated docstrings to clearly state tests validate the pattern/algorithm - Clarified that these are NOT integration tests of the Evaluator class - Updated module docstring to match The tests verify the deadlock detection algorithm works correctly using the same pattern as evaluation.py, but don't require the full Evaluator infrastructure (datasets, models, runtimes). Co-authored-by: openhands <openhands@all-hands.dev>

Co-authored-by: openhands <openhands@all-hands.dev>

- Fix type annotation: datapoint_id parameter in _create_error_output_with_metadata should be UUID | None to match LaminarEvalMetadata and PendingInstanceInfo - Remove unused imports in test_deadlock_detection.py: dataclass, Any, MagicMock, patch Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟡 Acceptable - Solves real problem with sound logic. Most previous concerns addressed. Main issue: tests validate the pattern/algorithm but dont test the actual Evaluator implementation in evaluation.py (extensively discussed in previous reviews, accepted as trade-off).

tests/test_deadlock_detection.py

benchmarks/utils/evaluation.py

Add defensive check for zero or negative EVALUATION_NO_PROGRESS_TIMEOUT values. If someone sets the env var to 0 or a negative number, deadlock detection would fire immediately. Now logs a warning and falls back to the default 1800s. Co-authored-by: openhands <openhands@all-hands.dev>

…arify test scope Changes: - Add validation for zero/negative EVALUATION_NO_PROGRESS_TIMEOUT values that fall back to default 1800s (prevents immediate deadlock detection) - Update docstring to clearly acknowledge that tests validate the algorithm/pattern, not the actual Evaluator class integration - Add tests for zero and negative timeout validation Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟡 Acceptable - Pragmatic solution to real production problem. Logic is sound, edge cases handled. One minor doc fix needed.

benchmarks/utils/evaluation.py

tests/test_deadlock_detection.py

Co-authored-by: openhands <openhands@all-hands.dev>

neubig marked this pull request as ready for review March 1, 2026 13:24

all-hands-bot reviewed Mar 1, 2026

View reviewed changes

benchmarks/utils/evaluation.py Outdated Show resolved Hide resolved

benchmarks/utils/evaluation.py Show resolved Hide resolved

benchmarks/utils/evaluation.py Outdated Show resolved Hide resolved

neubig marked this pull request as draft March 1, 2026 13:28

neubig marked this pull request as ready for review March 1, 2026 13:52

all-hands-bot reviewed Mar 1, 2026

View reviewed changes

benchmarks/utils/evaluation.py Outdated Show resolved Hide resolved

benchmarks/utils/evaluation.py Outdated Show resolved Hide resolved

tests/test_deadlock_detection.py Outdated Show resolved Hide resolved

neubig marked this pull request as draft March 1, 2026 14:13

neubig marked this pull request as ready for review March 1, 2026 14:19

all-hands-bot reviewed Mar 1, 2026

View reviewed changes

tests/test_deadlock_detection.py Outdated Show resolved Hide resolved

benchmarks/utils/evaluation.py Show resolved Hide resolved

benchmarks/utils/evaluation.py Show resolved Hide resolved

neubig marked this pull request as draft March 1, 2026 21:25

neubig marked this pull request as ready for review March 1, 2026 21:27

all-hands-bot reviewed Mar 1, 2026

View reviewed changes

tests/test_deadlock_detection.py Outdated Show resolved Hide resolved

benchmarks/utils/evaluation.py Outdated Show resolved Hide resolved

benchmarks/utils/evaluation.py Outdated Show resolved Hide resolved

neubig marked this pull request as draft March 2, 2026 01:52

neubig marked this pull request as ready for review March 2, 2026 01:58

all-hands-bot reviewed Mar 2, 2026

View reviewed changes

neubig marked this pull request as draft March 2, 2026 02:24

neubig and others added 3 commits March 2, 2026 02:29

fix: apply ruff formatting to test file

d23f175

Co-authored-by: openhands <openhands@all-hands.dev>

neubig marked this pull request as ready for review March 2, 2026 14:10

all-hands-bot reviewed Mar 2, 2026

View reviewed changes

tests/test_deadlock_detection.py Outdated Show resolved Hide resolved

benchmarks/utils/evaluation.py Show resolved Hide resolved

neubig marked this pull request as draft March 2, 2026 14:41

neubig marked this pull request as ready for review March 2, 2026 14:56

all-hands-bot reviewed Mar 2, 2026

View reviewed changes

Fix comment line number reference (484 -> 491)

600b946

Co-authored-by: openhands <openhands@all-hands.dev>

neubig marked this pull request as draft March 9, 2026 03:07

Conversation

neubig commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

New Log Messages

Environment Variables

Changes

Evidence

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neubig commented Feb 26, 2026 •

edited

Loading