Skip to content

Save-for-backward contract + arena pinning + poison-on-reset; gradcheck/parity harnesses; kernel numerics audit #128

Description

@dndungu

ztensor-side tracking for zerfoo docs/plan-gpu-training-hardening.md and ADR 006 (PR #127).

The bug class: nodes cache forward intermediates in struct fields and read them in Backward; the GPU arena overwrites them first (zerfoo#842, zerfoo#845, Wolf QK-norm backward -- three shipped instances).

Work here: (1) SaveForBackward API + graph-owned lifetime; (2) ArenaPool Pin/Unpin honored by ResetPool/MarkStepBoundary/reuse; (3) ZTENSOR_ARENA_POISON=1 NaN-poison on reset; (4) gradcheck core + OpInfo registry; (5) GPU-vs-CPU parity harness with interleaved arena-stress schedules; (6) kernels: drop global --use_fast_math (Makefile:7), fp32 fixed-order reduction accumulation, oracle-gate every kernel vs PyTorch (NGC container on the GB10); (7) ZTENSOR_DETERMINISTIC=1 mode.

Done = poison-mode full-suite green on GB10; oracle suite green without global fast-math; Wolf GB10 f32 fold clean (tracked in the zerfoo umbrella issue).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions