Skip to content

v1.9.0: arena overflow hit on FIRST per-sample alloc (64GB arena) + async overflow still stalls GB10 — crossasset training freezes step 0 #118

Description

@dndungu

Summary

Follow-up to #115. With v1.9.0 (stream-ordered overflow) the crossasset GB10
training still does not complete one step. Two distinct problems:

  1. The arena overflow is hit on the FIRST mini-batch sample's tiny alloc, even
    with ZERFOO_ARENA_SIZE_GB=64. A 64 GB arena should not be exhausted by one
    per-sample forward pass — this looks like an arena sizing/reset/accounting issue.
  2. The async overflow (cudaMallocAsync) itself still stalls on GB10 under
    memory pressure — so v1.9.0's fix changed sync→async but the overflow path still
    wedges.

Net: training freezes in the first ComputeGradients (step=0).

Pinned goroutine 1 (v1.9.0, ZERFOO_ARENA_SIZE_GB=64, GB10)

goroutine 1 [syscall]:
runtime.cgocall -> cuda._Cfunc_ccall_wrapper
 -> cuda.MallocAsync(0xc000=48KB)        runtime_purego.go:92   # async overflow
 -> cuda.(*ArenaPool).Alloc              arena.go:186           # overflow path (v1.9.0)
 -> gpuapi.(*CUDAArenaPool).Alloc        cuda_arena.go:59
 -> compute.gpuUnaryOp                   gpu_kernels.go:603
 -> compute.(*GPUEngine).gpuTanh
 -> ... forward pass, first sample

D-state sibling thread: folio_wait_bit_common -> __folio_lock_or_retry (page
fault during the async alloc).

Why (1) is the key puzzle

  • Trainer mini-batches: batch=256, but processes one sample per
    ComputeGradients
    (gradient accumulation), so the per-step working set is a
    single sample's activations — small.
  • Engine is CUDAArenaPool (not the MemPool fallback), and NewGPUEngine honors
    ZERFOO_ARENA_SIZE_GB (gpu_engine.go:225). No "arena pool not available" warning.
    So a 64 GB arena was created.
  • Yet ArenaPool.Alloc reaches the overflow branch for a 48 KB gpuTanh
    output on the first sample. So either the arena's usable capacity is far below
    64 GB on this path, or it is not being reset (StepScope/MarkStepBoundary) and a
    single forward+backward genuinely fills 64 GB (unlikely for one sample), or there
    is an offset/accounting bug.

Asks

  1. Add arena diagnostics (log at engine init + at first overflow): configured
    capacity, current offset, hits/misses/reuses, alloc count, and whether
    Reset/MarkStepBoundary is being driven by the training loop. This will say
    immediately whether the arena is mis-sized vs. legitimately full vs. never reset.
  2. Investigate why a single per-sample forward fills a 64 GB arena (or doesn't,
    and the overflow is an accounting bug).
  3. Harden the async overflow on GB10cudaMallocAsync under unified-memory
    pressure still stalls here; ideally overflow is never reached (proper
    sizing/reset), but the path should not wedge if it is.

Env / repro

NVIDIA GB10 (sm_121), ztensor v1.9.0. Wolf train-crossasset -gpu, full COIN
1m bars, folds=2 epochs=1, batch=256, ZERFOO_ARENA_SIZE_GB=64, mem 96Gi. Hangs in
the first step's forward pass. Capture rig + dumps:
wolf/scripts/t82-stall-watchdog.sh, wolf/.claude/scratch/t82-dump/.

Cross-refs

#115 (async overflow — necessary, still insufficient), #111, #106. Wolf devlog
2026-06-07 has the full chain across v1.8.1 -> v1.8.2 -> v1.9.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions