Skip to content

Add deadlock detection with no-progress timeout#458

Draft
neubig wants to merge 11 commits intomainfrom
add-deadlock-detection
Draft

Add deadlock detection with no-progress timeout#458
neubig wants to merge 11 commits intomainfrom
add-deadlock-detection

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Feb 26, 2026

Summary

Adds deadlock detection at the orchestrator level to catch when all workers are stuck and making no progress.

Problem

Even with per-instance timeouts, the evaluation job can deadlock if:

  1. Instance timeouts fire but workers are stuck in blocking I/O
  2. pool.shutdown(wait=True) waits forever for zombie workers
  3. No futures complete for extended periods

Solution

  1. No-Progress Timeout: Track last_progress_time for any completed or timed-out future. If no progress for 30 minutes (configurable via EVALUATION_NO_PROGRESS_TIMEOUT), log DEADLOCK DETECTED and terminate all pending instances.

  2. Force Terminate on Timeout: When any instances timed out, use _cleanup_pool(wait=False) instead of pool.shutdown(wait=True) to forcefully terminate zombie workers.

New Log Messages

DEADLOCK DETECTED: No progress for X minutes with N pending instances. Force terminating stuck workers.
N instances timed out or deadlocked. Force terminating zombie workers.

Environment Variables

  • EVALUATION_NO_PROGRESS_TIMEOUT - Seconds without progress before assuming deadlock (default: 1800 = 30 minutes)

Changes

  • benchmarks/utils/evaluation.py: Add progress tracking and deadlock detection logic

Evidence

Cannot be tested in current environment:

  • Resource needed: Real evaluation environment with workloads that could deadlock
  • Reason: Deadlock detection requires actual stuck workers to trigger
  • Manual verification steps:
    1. Run an evaluation with a known-deadlocking dataset/configuration
    2. Set EVALUATION_NO_PROGRESS_TIMEOUT=300 (5 minutes) for faster testing
    3. Verify DEADLOCK DETECTED appears in logs when workers hang
    4. Verify error outputs are created for deadlocked instances
    5. Verify pool shuts down (doesn't hang forever)

Detects when all workers are stuck by tracking time since last progress.
If no futures complete or timeout for 30 minutes (configurable via
EVALUATION_NO_PROGRESS_TIMEOUT), assumes deadlock and force terminates.

Also force terminates zombie workers on pool shutdown when any instances
timed out, since those workers are stuck in blocking I/O and will never
return normally.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review March 1, 2026 13:24
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - solves real problem, needs tests

Analysis: The logic is sound - track progress, detect deadlock, force terminate. Pragmatic solution to a real production problem. Main issue: no tests for this critical new behavior.

Verdict: ✅ Worth merging after tests added

Key Insight: Good defensive programming for worker deadlocks, but untested deadlock detection is itself a potential source of production failures.

@neubig neubig marked this pull request as draft March 1, 2026 13:28
- Move EVALUATION_NO_PROGRESS_TIMEOUT config parsing outside retry loop
  (it won't change between attempts)
- Add comprehensive test suite for deadlock detection:
  - test_deadlock_detection_triggers_on_no_progress
  - test_no_deadlock_when_progress_is_made
  - test_timed_out_count_increments_correctly
  - test_error_output_created_for_deadlocked_instances
  - test_pending_set_cleared_after_deadlock
  - test_timeout_from_env_var
  - test_default_timeout
- Add comment explaining why force termination is necessary
  (zombie workers won't respond to graceful shutdown)

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review March 1, 2026 13:52
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solves real problem, but tests fundamentally flawed

The deadlock detection logic is sound and addresses a real production issue. However, the test suite has a critical flaw: it reimplements the detection logic instead of testing the actual evaluation.py code.

@neubig neubig marked this pull request as draft March 1, 2026 14:13
…ethod

- Add try/except for EVALUATION_NO_PROGRESS_TIMEOUT parsing with fallback
- Extract _create_error_output_with_metadata helper to remove duplication
- Add test for invalid env var handling
- Add docstring note explaining test approach (unit tests verify logic patterns)

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review March 1, 2026 14:19
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Pragmatic solution, but tests are smoke and mirrors

Analysis: The deadlock detection logic is sound and solves a real production problem. Helper method eliminates duplication nicely. Force-terminate approach is pragmatic. BUT: the test suite is fundamentally flawed - it reimplements the logic patterns instead of testing the actual evaluation.py code.

Verdict: ✅ Worth merging IF you accept the test coverage gap, ❌ Needs rework if you want real test coverage

Key Insight: You wrote 288 lines of tests that would still pass if someone deleted the deadlock detection from evaluation.py.

@neubig neubig marked this pull request as draft March 1, 2026 21:25
- Fixed style inconsistency: now increment timed_out_count per-instance
  in deadlock detection loop for consistency with per-instance timeout
  handling above
- Added TestEvaluatorDeadlockIntegration class with real integration tests
  that import and test the actual Evaluator class
- Updated module docstring to describe both unit and integration tests

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review March 1, 2026 21:27
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Pragmatic solution, but tests provide false confidence

Analysis: The deadlock detection logic is sound and addresses a real production problem. Helper method eliminates duplication. Force-terminate approach is pragmatic when workers are zombies.

Key Issue: Integration tests claim to verify real behavior but only assert the code doesn't hang. They don't verify error outputs are created, timed_out_count is incremented, or pending is cleared - giving false confidence about test coverage.

@neubig neubig marked this pull request as draft March 2, 2026 01:52
…ions

- Added explicit logging when pending_info is None (indicates bookkeeping bug)
- Changed from silent skip to log + continue pattern for both timeout and deadlock
- Improved integration test to verify error outputs are created with proper content
- Test now asserts: 2 error outputs, 'deadlock' in error message, test_result is None

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review March 2, 2026 01:58
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solves real problem, but tests fundamentally flawed

The deadlock detection logic is sound and addresses a real production issue. However, the test suite has a critical flaw: it reimplements the detection logic instead of testing the actual evaluation.py code.

@neubig neubig marked this pull request as draft March 2, 2026 02:24
neubig and others added 3 commits March 2, 2026 02:29
- Renamed TestEvaluatorDeadlockIntegration -> TestDeadlockPatterns
- Updated docstrings to clearly state tests validate the pattern/algorithm
- Clarified that these are NOT integration tests of the Evaluator class
- Updated module docstring to match

The tests verify the deadlock detection algorithm works correctly using
the same pattern as evaluation.py, but don't require the full Evaluator
infrastructure (datasets, models, runtimes).

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
- Fix type annotation: datapoint_id parameter in _create_error_output_with_metadata
  should be UUID | None to match LaminarEvalMetadata and PendingInstanceInfo
- Remove unused imports in test_deadlock_detection.py: dataclass, Any, MagicMock, patch

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review March 2, 2026 14:10
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solves real problem with sound logic. Most previous concerns addressed. Main issue: tests validate the pattern/algorithm but dont test the actual Evaluator implementation in evaluation.py (extensively discussed in previous reviews, accepted as trade-off).

@neubig neubig marked this pull request as draft March 2, 2026 14:41
Add defensive check for zero or negative EVALUATION_NO_PROGRESS_TIMEOUT values.
If someone sets the env var to 0 or a negative number, deadlock detection would
fire immediately. Now logs a warning and falls back to the default 1800s.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review March 2, 2026 14:56
…arify test scope

Changes:
- Add validation for zero/negative EVALUATION_NO_PROGRESS_TIMEOUT values
  that fall back to default 1800s (prevents immediate deadlock detection)
- Update docstring to clearly acknowledge that tests validate the
  algorithm/pattern, not the actual Evaluator class integration
- Add tests for zero and negative timeout validation

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Pragmatic solution to real production problem. Logic is sound, edge cases handled. One minor doc fix needed.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as draft March 9, 2026 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants