Skip to content

COMP: Stall-based watchdog for ParallelSparseFieldLevelSet robustness tests#6530

Open
hjmjohnson wants to merge 1 commit into
InsightSoftwareConsortium:mainfrom
hjmjohnson:fix-pslsif-robustness-stall-watchdog
Open

COMP: Stall-based watchdog for ParallelSparseFieldLevelSet robustness tests#6530
hjmjohnson wants to merge 1 commit into
InsightSoftwareConsortium:mainfrom
hjmjohnson:fix-pslsif-robustness-stall-watchdog

Conversation

@hjmjohnson

Copy link
Copy Markdown
Member

Fixes the nightly itk-valgrind failures of ParallelSparseFieldLevelSetRobustness.SweepRepeat and .ConcurrentMultiPipeline (tracked in #6518). The tests' own watchdog used a fixed 300 s wall-clock deadline that cannot distinguish a real dispatch deadlock from slow-but-progressing execution; under valgrind these solves legitimately run past 300 s and the watchdog std::abort()-ed them. The watchdog now fires on absence of forward progress instead.

Root cause

The three robustness GTests wrap their body in a watchdog thread meant to turn a ParallelizeArray dispatch deadlock into a clean failure rather than a hung driver. A total-elapsed-time budget is the wrong signal: valgrind/TSan/Debug runs are slow but make steady progress, so the fixed 300 s deadline false-aborts them. CDash showed this deterministically every night — SweepRepeat ~304 s and ConcurrentMultiPipeline ~310–312 s against the 300 s deadline. This is the second time a fixed number was outgrown (an earlier change had already raised it 30 s → 300 s).

The fix

Drive the watchdog off a heartbeat counter that advances with solver progress, and abort only when the heartbeat does not advance for the stall window. An itk::IterationEvent observer attached to the filter in run_one bumps the heartbeat once per level-set iteration. A real deadlock freezes the heartbeat (no iterations complete) and still aborts promptly; valgrind/TSan/Debug runs keep it advancing regardless of total runtime. The per-iteration granularity is what makes ConcurrentMultiPipeline robust under valgrind's thread serialization, where no whole-pipeline future completes for minutes.

Validation

Built and ran locally before pushing (per local-test-first discipline).

Check Result
Native ctest -R ParallelSparseFieldLevelSetRobustness 3/3 pass (3.9 s)
valgrind memcheck ConcurrentMultiPipeline PASS @ 1002 s (was abort @ 300 s)
valgrind memcheck SweepRepeat PASS @ 672 s (was abort @ 300 s)
pre-commit run --all-files clean

An intermediate version that bumped the heartbeat only at whole-pipeline completion false-aborted ConcurrentMultiPipeline at 123 s under valgrind; the per-iteration observer fixes that and was re-validated above.

… tests

The robustness GTests aborted on a fixed 300 s wall-clock deadline that cannot
tell a deadlock from slow progress, so valgrind runs failed the itk-valgrind
nightly. Drive the watchdog off forward progress instead: an IterationEvent
observer in run_one advances a heartbeat, and the watchdog aborts only when the
heartbeat stalls. A real deadlock freezes it; slow valgrind/TSan/Debug runs keep
it advancing.
@github-actions github-actions Bot added type:Compiler Compiler support or related warnings type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct area:Segmentation Issues affecting the Segmentation module labels Jun 29, 2026
@hjmjohnson hjmjohnson marked this pull request as ready for review June 29, 2026 16:48
@hjmjohnson

Copy link
Copy Markdown
Member Author

NOTE: We could have just increased the 300s to some huge number, but then an early deadlock would take a long time to report failure. This is a little more fine-grained. An early deadlock stops in 120 seconds after the deadlock. But allows completion for very long-running times, as long as each iteration takes less than 120 seconds.

@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR changes the robustness tests to use progress-based watchdogs. The main changes are:

  • Added a heartbeat counter shared with the watchdog thread.
  • Added an IterationEvent observer that bumps the heartbeat during solver iterations.
  • Updated the sequential and concurrent robustness tests to pass the heartbeat into each solver run.

Confidence Score: 5/5

This looks safe to merge after considering the watchdog edge case.

  • The heartbeat lifetime and concurrent sharing are safe in the changed tests.
  • The remaining issue is a conditional test-watchdog false abort during long non-iteration work.

Modules/Segmentation/LevelSets/test/itkParallelSparseFieldLevelSetImageFilterRobustnessGTest.cxx

Important Files Changed

Filename Overview
Modules/Segmentation/LevelSets/test/itkParallelSparseFieldLevelSetImageFilterRobustnessGTest.cxx Replaces a fixed test deadline with a heartbeat watchdog driven by solver iteration events.

Reviews (1): Last reviewed commit: "COMP: Stall-based watchdog for ParallelS..." | Re-trigger Greptile

if (heartbeat != nullptr)
{
auto cmd = HeartbeatCommand::New();
cmd->m_Heartbeat = heartbeat;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-Iteration Work Looks Stalled

The heartbeat only advances on IterationEvent, but each run_one still has setup before the first event and cleanup after the last event. If a sanitizer or valgrind run spends more than 120 seconds in one of those non-iteration phases, the watchdog aborts a still-progressing test as a dispatch deadlock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Segmentation Issues affecting the Segmentation module type:Compiler Compiler support or related warnings type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants