COMP: Stall-based watchdog for ParallelSparseFieldLevelSet robustness tests#6530
Conversation
… tests The robustness GTests aborted on a fixed 300 s wall-clock deadline that cannot tell a deadlock from slow progress, so valgrind runs failed the itk-valgrind nightly. Drive the watchdog off forward progress instead: an IterationEvent observer in run_one advances a heartbeat, and the watchdog aborts only when the heartbeat stalls. A real deadlock freezes it; slow valgrind/TSan/Debug runs keep it advancing.
|
NOTE: We could have just increased the 300s to some huge number, but then an early deadlock would take a long time to report failure. This is a little more fine-grained. An early deadlock stops in 120 seconds after the deadlock. But allows completion for very long-running times, as long as each iteration takes less than 120 seconds. |
|
| Filename | Overview |
|---|---|
| Modules/Segmentation/LevelSets/test/itkParallelSparseFieldLevelSetImageFilterRobustnessGTest.cxx | Replaces a fixed test deadline with a heartbeat watchdog driven by solver iteration events. |
Reviews (1): Last reviewed commit: "COMP: Stall-based watchdog for ParallelS..." | Re-trigger Greptile
| if (heartbeat != nullptr) | ||
| { | ||
| auto cmd = HeartbeatCommand::New(); | ||
| cmd->m_Heartbeat = heartbeat; |
There was a problem hiding this comment.
Non-Iteration Work Looks Stalled
The heartbeat only advances on IterationEvent, but each run_one still has setup before the first event and cleanup after the last event. If a sanitizer or valgrind run spends more than 120 seconds in one of those non-iteration phases, the watchdog aborts a still-progressing test as a dispatch deadlock.
Fixes the nightly
itk-valgrindfailures ofParallelSparseFieldLevelSetRobustness.SweepRepeatand.ConcurrentMultiPipeline(tracked in #6518). The tests' own watchdog used a fixed 300 s wall-clock deadline that cannot distinguish a real dispatch deadlock from slow-but-progressing execution; under valgrind these solves legitimately run past 300 s and the watchdogstd::abort()-ed them. The watchdog now fires on absence of forward progress instead.Root cause
The three robustness GTests wrap their body in a watchdog thread meant to turn a
ParallelizeArraydispatch deadlock into a clean failure rather than a hung driver. A total-elapsed-time budget is the wrong signal: valgrind/TSan/Debug runs are slow but make steady progress, so the fixed 300 s deadline false-aborts them. CDash showed this deterministically every night —SweepRepeat~304 s andConcurrentMultiPipeline~310–312 s against the 300 s deadline. This is the second time a fixed number was outgrown (an earlier change had already raised it 30 s → 300 s).The fix
Drive the watchdog off a heartbeat counter that advances with solver progress, and abort only when the heartbeat does not advance for the stall window. An
itk::IterationEventobserver attached to the filter inrun_onebumps the heartbeat once per level-set iteration. A real deadlock freezes the heartbeat (no iterations complete) and still aborts promptly; valgrind/TSan/Debug runs keep it advancing regardless of total runtime. The per-iteration granularity is what makesConcurrentMultiPipelinerobust under valgrind's thread serialization, where no whole-pipelinefuturecompletes for minutes.Validation
Built and ran locally before pushing (per local-test-first discipline).
ctest -R ParallelSparseFieldLevelSetRobustnessConcurrentMultiPipelineSweepRepeatpre-commit run --all-filesAn intermediate version that bumped the heartbeat only at whole-pipeline completion false-aborted
ConcurrentMultiPipelineat 123 s under valgrind; the per-iteration observer fixes that and was re-validated above.