COMP: Stall-based watchdog for ParallelSparseFieldLevelSet robustness tests by hjmjohnson · Pull Request #6530 · InsightSoftwareConsortium/ITK

hjmjohnson · 2026-06-29T16:45:35Z

Fixes the nightly itk-valgrind failures of ParallelSparseFieldLevelSetRobustness.SweepRepeat and .ConcurrentMultiPipeline (tracked in #6518). The tests' own watchdog used a fixed 300 s wall-clock deadline that cannot distinguish a real dispatch deadlock from slow-but-progressing execution; under valgrind these solves legitimately run past 300 s and the watchdog std::abort()-ed them. The watchdog now fires on absence of forward progress instead.

Root cause

The three robustness GTests wrap their body in a watchdog thread meant to turn a ParallelizeArray dispatch deadlock into a clean failure rather than a hung driver. A total-elapsed-time budget is the wrong signal: valgrind/TSan/Debug runs are slow but make steady progress, so the fixed 300 s deadline false-aborts them. CDash showed this deterministically every night — SweepRepeat ~304 s and ConcurrentMultiPipeline ~310–312 s against the 300 s deadline. This is the second time a fixed number was outgrown (an earlier change had already raised it 30 s → 300 s).

The fix

Drive the watchdog off a heartbeat counter that advances with solver progress, and abort only when the heartbeat does not advance for the stall window. An itk::IterationEvent observer attached to the filter in run_one bumps the heartbeat once per level-set iteration. A real deadlock freezes the heartbeat (no iterations complete) and still aborts promptly; valgrind/TSan/Debug runs keep it advancing regardless of total runtime. The per-iteration granularity is what makes ConcurrentMultiPipeline robust under valgrind's thread serialization, where no whole-pipeline future completes for minutes.

Validation

Built and ran locally before pushing (per local-test-first discipline).

Check	Result
Native `ctest -R ParallelSparseFieldLevelSetRobustness`	3/3 pass (3.9 s)
valgrind memcheck `ConcurrentMultiPipeline`	PASS @ 1002 s (was abort @ 300 s)
valgrind memcheck `SweepRepeat`	PASS @ 672 s (was abort @ 300 s)
`pre-commit run --all-files`	clean

An intermediate version that bumped the heartbeat only at whole-pipeline completion false-aborted ConcurrentMultiPipeline at 123 s under valgrind; the per-iteration observer fixes that and was re-validated above.

… tests The robustness GTests aborted on a fixed 300 s wall-clock deadline that cannot tell a deadlock from slow progress, so valgrind runs failed the itk-valgrind nightly. Drive the watchdog off forward progress instead: an IterationEvent observer in run_one advances a heartbeat, and the watchdog aborts only when the heartbeat stalls. A real deadlock freezes it; slow valgrind/TSan/Debug runs keep it advancing.

hjmjohnson · 2026-06-29T16:52:04Z

NOTE: We could have just increased the 300s to some huge number, but then an early deadlock would take a long time to report failure. This is a little more fine-grained. An early deadlock stops in 120 seconds after the deadlock. But allows completion for very long-running times, as long as each iteration takes less than 120 seconds.

greptile-apps · 2026-06-29T16:52:19Z

Greptile Summary

This PR changes the robustness tests to use progress-based watchdogs. The main changes are:

Added a heartbeat counter shared with the watchdog thread.
Added an IterationEvent observer that bumps the heartbeat during solver iterations.
Updated the sequential and concurrent robustness tests to pass the heartbeat into each solver run.

Confidence Score: 5/5

This looks safe to merge after considering the watchdog edge case.

The heartbeat lifetime and concurrent sharing are safe in the changed tests.
The remaining issue is a conditional test-watchdog false abort during long non-iteration work.

Modules/Segmentation/LevelSets/test/itkParallelSparseFieldLevelSetImageFilterRobustnessGTest.cxx

Important Files Changed

Filename	Overview
Modules/Segmentation/LevelSets/test/itkParallelSparseFieldLevelSetImageFilterRobustnessGTest.cxx	Replaces a fixed test deadline with a heartbeat watchdog driven by solver iteration events.

_{Reviews (1): Last reviewed commit: "COMP: Stall-based watchdog for ParallelS..." | Re-trigger Greptile}

greptile-apps · 2026-06-29T16:52:23Z

+  if (heartbeat != nullptr)
+  {
+    auto cmd = HeartbeatCommand::New();
+    cmd->m_Heartbeat = heartbeat;


Non-Iteration Work Looks Stalled

The heartbeat only advances on IterationEvent, but each run_one still has setup before the first event and cleanup after the last event. If a sanitizer or valgrind run spends more than 120 seconds in one of those non-iteration phases, the watchdog aborts a still-progressing test as a dispatch deadlock.

github-actions Bot added type:Compiler Compiler support or related warnings type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct area:Segmentation Issues affecting the Segmentation module labels Jun 29, 2026

hjmjohnson mentioned this pull request Jun 29, 2026

COMP: Regenerate ShapeLabelMap baselines at full double precision #6521

Open

hjmjohnson marked this pull request as ready for review June 29, 2026 16:48

greptile-apps Bot reviewed Jun 29, 2026

View reviewed changes

dzenanz approved these changes Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

COMP: Stall-based watchdog for ParallelSparseFieldLevelSet robustness tests#6530

COMP: Stall-based watchdog for ParallelSparseFieldLevelSet robustness tests#6530
hjmjohnson wants to merge 1 commit into
InsightSoftwareConsortium:mainfrom
hjmjohnson:fix-pslsif-robustness-stall-watchdog

hjmjohnson commented Jun 29, 2026

Uh oh!

hjmjohnson commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jun 29, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps Bot Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

hjmjohnson commented Jun 29, 2026

Uh oh!

hjmjohnson commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jun 29, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants