Add deadlock detection with no-progress timeout#458
Conversation
Detects when all workers are stuck by tracking time since last progress. If no futures complete or timeout for 30 minutes (configurable via EVALUATION_NO_PROGRESS_TIMEOUT), assumes deadlock and force terminates. Also force terminates zombie workers on pool shutdown when any instances timed out, since those workers are stuck in blocking I/O and will never return normally. Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - solves real problem, needs tests
Analysis: The logic is sound - track progress, detect deadlock, force terminate. Pragmatic solution to a real production problem. Main issue: no tests for this critical new behavior.
Verdict: ✅ Worth merging after tests added
Key Insight: Good defensive programming for worker deadlocks, but untested deadlock detection is itself a potential source of production failures.
- Move EVALUATION_NO_PROGRESS_TIMEOUT config parsing outside retry loop (it won't change between attempts) - Add comprehensive test suite for deadlock detection: - test_deadlock_detection_triggers_on_no_progress - test_no_deadlock_when_progress_is_made - test_timed_out_count_increments_correctly - test_error_output_created_for_deadlocked_instances - test_pending_set_cleared_after_deadlock - test_timeout_from_env_var - test_default_timeout - Add comment explaining why force termination is necessary (zombie workers won't respond to graceful shutdown) Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Solves real problem, but tests fundamentally flawed
The deadlock detection logic is sound and addresses a real production issue. However, the test suite has a critical flaw: it reimplements the detection logic instead of testing the actual evaluation.py code.
…ethod - Add try/except for EVALUATION_NO_PROGRESS_TIMEOUT parsing with fallback - Extract _create_error_output_with_metadata helper to remove duplication - Add test for invalid env var handling - Add docstring note explaining test approach (unit tests verify logic patterns) Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Pragmatic solution, but tests are smoke and mirrors
Analysis: The deadlock detection logic is sound and solves a real production problem. Helper method eliminates duplication nicely. Force-terminate approach is pragmatic. BUT: the test suite is fundamentally flawed - it reimplements the logic patterns instead of testing the actual evaluation.py code.
Verdict: ✅ Worth merging IF you accept the test coverage gap, ❌ Needs rework if you want real test coverage
Key Insight: You wrote 288 lines of tests that would still pass if someone deleted the deadlock detection from evaluation.py.
- Fixed style inconsistency: now increment timed_out_count per-instance in deadlock detection loop for consistency with per-instance timeout handling above - Added TestEvaluatorDeadlockIntegration class with real integration tests that import and test the actual Evaluator class - Updated module docstring to describe both unit and integration tests Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Pragmatic solution, but tests provide false confidence
Analysis: The deadlock detection logic is sound and addresses a real production problem. Helper method eliminates duplication. Force-terminate approach is pragmatic when workers are zombies.
Key Issue: Integration tests claim to verify real behavior but only assert the code doesn't hang. They don't verify error outputs are created, timed_out_count is incremented, or pending is cleared - giving false confidence about test coverage.
…ions - Added explicit logging when pending_info is None (indicates bookkeeping bug) - Changed from silent skip to log + continue pattern for both timeout and deadlock - Improved integration test to verify error outputs are created with proper content - Test now asserts: 2 error outputs, 'deadlock' in error message, test_result is None Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Solves real problem, but tests fundamentally flawed
The deadlock detection logic is sound and addresses a real production issue. However, the test suite has a critical flaw: it reimplements the detection logic instead of testing the actual evaluation.py code.
- Renamed TestEvaluatorDeadlockIntegration -> TestDeadlockPatterns - Updated docstrings to clearly state tests validate the pattern/algorithm - Clarified that these are NOT integration tests of the Evaluator class - Updated module docstring to match The tests verify the deadlock detection algorithm works correctly using the same pattern as evaluation.py, but don't require the full Evaluator infrastructure (datasets, models, runtimes). Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
- Fix type annotation: datapoint_id parameter in _create_error_output_with_metadata should be UUID | None to match LaminarEvalMetadata and PendingInstanceInfo - Remove unused imports in test_deadlock_detection.py: dataclass, Any, MagicMock, patch Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Solves real problem with sound logic. Most previous concerns addressed. Main issue: tests validate the pattern/algorithm but dont test the actual Evaluator implementation in evaluation.py (extensively discussed in previous reviews, accepted as trade-off).
Add defensive check for zero or negative EVALUATION_NO_PROGRESS_TIMEOUT values. If someone sets the env var to 0 or a negative number, deadlock detection would fire immediately. Now logs a warning and falls back to the default 1800s. Co-authored-by: openhands <openhands@all-hands.dev>
…arify test scope Changes: - Add validation for zero/negative EVALUATION_NO_PROGRESS_TIMEOUT values that fall back to default 1800s (prevents immediate deadlock detection) - Update docstring to clearly acknowledge that tests validate the algorithm/pattern, not the actual Evaluator class integration - Add tests for zero and negative timeout validation Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Pragmatic solution to real production problem. Logic is sound, edge cases handled. One minor doc fix needed.
Co-authored-by: openhands <openhands@all-hands.dev>
Summary
Adds deadlock detection at the orchestrator level to catch when all workers are stuck and making no progress.
Problem
Even with per-instance timeouts, the evaluation job can deadlock if:
pool.shutdown(wait=True)waits forever for zombie workersSolution
No-Progress Timeout: Track
last_progress_timefor any completed or timed-out future. If no progress for 30 minutes (configurable viaEVALUATION_NO_PROGRESS_TIMEOUT), logDEADLOCK DETECTEDand terminate all pending instances.Force Terminate on Timeout: When any instances timed out, use
_cleanup_pool(wait=False)instead ofpool.shutdown(wait=True)to forcefully terminate zombie workers.New Log Messages
Environment Variables
EVALUATION_NO_PROGRESS_TIMEOUT- Seconds without progress before assuming deadlock (default: 1800 = 30 minutes)Changes
benchmarks/utils/evaluation.py: Add progress tracking and deadlock detection logicEvidence
Cannot be tested in current environment:
EVALUATION_NO_PROGRESS_TIMEOUT=300(5 minutes) for faster testingDEADLOCK DETECTEDappears in logs when workers hang