ztensor-side tracking for zerfoo docs/plan-gpu-training-hardening.md and ADR 006 (PR #127).
The bug class: nodes cache forward intermediates in struct fields and read them in Backward; the GPU arena overwrites them first (zerfoo#842, zerfoo#845, Wolf QK-norm backward -- three shipped instances).
Work here: (1) SaveForBackward API + graph-owned lifetime; (2) ArenaPool Pin/Unpin honored by ResetPool/MarkStepBoundary/reuse; (3) ZTENSOR_ARENA_POISON=1 NaN-poison on reset; (4) gradcheck core + OpInfo registry; (5) GPU-vs-CPU parity harness with interleaved arena-stress schedules; (6) kernels: drop global --use_fast_math (Makefile:7), fp32 fixed-order reduction accumulation, oracle-gate every kernel vs PyTorch (NGC container on the GB10); (7) ZTENSOR_DETERMINISTIC=1 mode.
Done = poison-mode full-suite green on GB10; oracle suite green without global fast-math; Wolf GB10 f32 fold clean (tracked in the zerfoo umbrella issue).
ztensor-side tracking for zerfoo docs/plan-gpu-training-hardening.md and ADR 006 (PR #127).
The bug class: nodes cache forward intermediates in struct fields and read them in Backward; the GPU arena overwrites them first (zerfoo#842, zerfoo#845, Wolf QK-norm backward -- three shipped instances).
Work here: (1) SaveForBackward API + graph-owned lifetime; (2) ArenaPool Pin/Unpin honored by ResetPool/MarkStepBoundary/reuse; (3) ZTENSOR_ARENA_POISON=1 NaN-poison on reset; (4) gradcheck core + OpInfo registry; (5) GPU-vs-CPU parity harness with interleaved arena-stress schedules; (6) kernels: drop global --use_fast_math (Makefile:7), fp32 fixed-order reduction accumulation, oracle-gate every kernel vs PyTorch (NGC container on the GB10); (7) ZTENSOR_DETERMINISTIC=1 mode.
Done = poison-mode full-suite green on GB10; oracle suite green without global fast-math; Wolf GB10 f32 fold clean (tracked in the zerfoo umbrella issue).