Summary
Follow-up to #115. With v1.9.0 (stream-ordered overflow) the crossasset GB10
training still does not complete one step. Two distinct problems:
- The arena overflow is hit on the FIRST mini-batch sample's tiny alloc, even
with ZERFOO_ARENA_SIZE_GB=64. A 64 GB arena should not be exhausted by one
per-sample forward pass — this looks like an arena sizing/reset/accounting issue.
- The async overflow (
cudaMallocAsync) itself still stalls on GB10 under
memory pressure — so v1.9.0's fix changed sync→async but the overflow path still
wedges.
Net: training freezes in the first ComputeGradients (step=0).
Pinned goroutine 1 (v1.9.0, ZERFOO_ARENA_SIZE_GB=64, GB10)
goroutine 1 [syscall]:
runtime.cgocall -> cuda._Cfunc_ccall_wrapper
-> cuda.MallocAsync(0xc000=48KB) runtime_purego.go:92 # async overflow
-> cuda.(*ArenaPool).Alloc arena.go:186 # overflow path (v1.9.0)
-> gpuapi.(*CUDAArenaPool).Alloc cuda_arena.go:59
-> compute.gpuUnaryOp gpu_kernels.go:603
-> compute.(*GPUEngine).gpuTanh
-> ... forward pass, first sample
D-state sibling thread: folio_wait_bit_common -> __folio_lock_or_retry (page
fault during the async alloc).
Why (1) is the key puzzle
- Trainer mini-batches:
batch=256, but processes one sample per
ComputeGradients (gradient accumulation), so the per-step working set is a
single sample's activations — small.
- Engine is
CUDAArenaPool (not the MemPool fallback), and NewGPUEngine honors
ZERFOO_ARENA_SIZE_GB (gpu_engine.go:225). No "arena pool not available" warning.
So a 64 GB arena was created.
- Yet
ArenaPool.Alloc reaches the overflow branch for a 48 KB gpuTanh
output on the first sample. So either the arena's usable capacity is far below
64 GB on this path, or it is not being reset (StepScope/MarkStepBoundary) and a
single forward+backward genuinely fills 64 GB (unlikely for one sample), or there
is an offset/accounting bug.
Asks
- Add arena diagnostics (log at engine init + at first overflow): configured
capacity, current offset, hits/misses/reuses, alloc count, and whether
Reset/MarkStepBoundary is being driven by the training loop. This will say
immediately whether the arena is mis-sized vs. legitimately full vs. never reset.
- Investigate why a single per-sample forward fills a 64 GB arena (or doesn't,
and the overflow is an accounting bug).
- Harden the async overflow on GB10 —
cudaMallocAsync under unified-memory
pressure still stalls here; ideally overflow is never reached (proper
sizing/reset), but the path should not wedge if it is.
Env / repro
NVIDIA GB10 (sm_121), ztensor v1.9.0. Wolf train-crossasset -gpu, full COIN
1m bars, folds=2 epochs=1, batch=256, ZERFOO_ARENA_SIZE_GB=64, mem 96Gi. Hangs in
the first step's forward pass. Capture rig + dumps:
wolf/scripts/t82-stall-watchdog.sh, wolf/.claude/scratch/t82-dump/.
Cross-refs
#115 (async overflow — necessary, still insufficient), #111, #106. Wolf devlog
2026-06-07 has the full chain across v1.8.1 -> v1.8.2 -> v1.9.0.
Summary
Follow-up to #115. With v1.9.0 (stream-ordered overflow) the crossasset GB10
training still does not complete one step. Two distinct problems:
with
ZERFOO_ARENA_SIZE_GB=64. A 64 GB arena should not be exhausted by oneper-sample forward pass — this looks like an arena sizing/reset/accounting issue.
cudaMallocAsync) itself still stalls on GB10 undermemory pressure — so v1.9.0's fix changed sync→async but the overflow path still
wedges.
Net: training freezes in the first
ComputeGradients(step=0).Pinned goroutine 1 (v1.9.0, ZERFOO_ARENA_SIZE_GB=64, GB10)
D-state sibling thread:
folio_wait_bit_common -> __folio_lock_or_retry(pagefault during the async alloc).
Why (1) is the key puzzle
batch=256, but processes one sample perComputeGradients(gradient accumulation), so the per-step working set is asingle sample's activations — small.
CUDAArenaPool(not the MemPool fallback), andNewGPUEnginehonorsZERFOO_ARENA_SIZE_GB(gpu_engine.go:225). No "arena pool not available" warning.So a 64 GB arena was created.
ArenaPool.Allocreaches the overflow branch for a 48 KBgpuTanhoutput on the first sample. So either the arena's usable capacity is far below
64 GB on this path, or it is not being reset (StepScope/MarkStepBoundary) and a
single forward+backward genuinely fills 64 GB (unlikely for one sample), or there
is an offset/accounting bug.
Asks
capacity, current
offset,hits/misses/reuses, alloc count, and whetherReset/MarkStepBoundaryis being driven by the training loop. This will sayimmediately whether the arena is mis-sized vs. legitimately full vs. never reset.
and the overflow is an accounting bug).
cudaMallocAsyncunder unified-memorypressure still stalls here; ideally overflow is never reached (proper
sizing/reset), but the path should not wedge if it is.
Env / repro
NVIDIA GB10 (sm_121), ztensor v1.9.0. Wolf train-crossasset
-gpu, full COIN1m bars, folds=2 epochs=1, batch=256,
ZERFOO_ARENA_SIZE_GB=64, mem 96Gi. Hangs inthe first step's forward pass. Capture rig + dumps:
wolf/scripts/t82-stall-watchdog.sh,wolf/.claude/scratch/t82-dump/.Cross-refs
#115 (async overflow — necessary, still insufficient), #111, #106. Wolf devlog
2026-06-07 has the full chain across v1.8.1 -> v1.8.2 -> v1.9.0.