v1.9.0: arena overflow hit on FIRST per-sample alloc (64GB arena) + async overflow still stalls GB10 — crossasset training freezes step 0

## Summary

Follow-up to #115. With **v1.9.0** (stream-ordered overflow) the crossasset GB10
training **still does not complete one step**. Two distinct problems:

1. **The arena overflow is hit on the FIRST mini-batch sample's tiny alloc**, even
   with `ZERFOO_ARENA_SIZE_GB=64`. A 64 GB arena should not be exhausted by one
   per-sample forward pass — this looks like an arena sizing/reset/accounting issue.
2. **The async overflow (`cudaMallocAsync`) itself still stalls on GB10** under
   memory pressure — so v1.9.0's fix changed sync→async but the overflow path still
   wedges.

Net: training freezes in the first `ComputeGradients` (step=0).

## Pinned goroutine 1 (v1.9.0, ZERFOO_ARENA_SIZE_GB=64, GB10)

```
goroutine 1 [syscall]:
runtime.cgocall -> cuda._Cfunc_ccall_wrapper
 -> cuda.MallocAsync(0xc000=48KB)        runtime_purego.go:92   # async overflow
 -> cuda.(*ArenaPool).Alloc              arena.go:186           # overflow path (v1.9.0)
 -> gpuapi.(*CUDAArenaPool).Alloc        cuda_arena.go:59
 -> compute.gpuUnaryOp                   gpu_kernels.go:603
 -> compute.(*GPUEngine).gpuTanh
 -> ... forward pass, first sample
```
D-state sibling thread: `folio_wait_bit_common -> __folio_lock_or_retry` (page
fault during the async alloc).

## Why (1) is the key puzzle

- Trainer mini-batches: `batch=256`, but processes **one sample per
  `ComputeGradients`** (gradient accumulation), so the per-step working set is a
  single sample's activations — small.
- Engine is `CUDAArenaPool` (not the MemPool fallback), and `NewGPUEngine` honors
  `ZERFOO_ARENA_SIZE_GB` (gpu_engine.go:225). No "arena pool not available" warning.
  So a 64 GB arena was created.
- Yet `ArenaPool.Alloc` reaches the **overflow** branch for a 48 KB `gpuTanh`
  output on the first sample. So either the arena's usable capacity is far below
  64 GB on this path, or it is not being reset (StepScope/MarkStepBoundary) and a
  single forward+backward genuinely fills 64 GB (unlikely for one sample), or there
  is an offset/accounting bug.

## Asks

1. **Add arena diagnostics** (log at engine init + at first overflow): configured
   capacity, current `offset`, `hits/misses/reuses`, alloc count, and whether
   `Reset`/`MarkStepBoundary` is being driven by the training loop. This will say
   immediately whether the arena is mis-sized vs. legitimately full vs. never reset.
2. **Investigate why a single per-sample forward fills a 64 GB arena** (or doesn't,
   and the overflow is an accounting bug).
3. **Harden the async overflow on GB10** — `cudaMallocAsync` under unified-memory
   pressure still stalls here; ideally overflow is never reached (proper
   sizing/reset), but the path should not wedge if it is.

## Env / repro

NVIDIA GB10 (sm_121), ztensor **v1.9.0**. Wolf train-crossasset `-gpu`, full COIN
1m bars, folds=2 epochs=1, batch=256, `ZERFOO_ARENA_SIZE_GB=64`, mem 96Gi. Hangs in
the first step's forward pass. Capture rig + dumps:
`wolf/scripts/t82-stall-watchdog.sh`, `wolf/.claude/scratch/t82-dump/`.

## Cross-refs

#115 (async overflow — necessary, still insufficient), #111, #106. Wolf devlog
2026-06-07 has the full chain across v1.8.1 -> v1.8.2 -> v1.9.0.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.9.0: arena overflow hit on FIRST per-sample alloc (64GB arena) + async overflow still stalls GB10 — crossasset training freezes step 0 #118

Summary

Pinned goroutine 1 (v1.9.0, ZERFOO_ARENA_SIZE_GB=64, GB10)

Why (1) is the key puzzle

Asks

Env / repro

Cross-refs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

v1.9.0: arena overflow hit on FIRST per-sample alloc (64GB arena) + async overflow still stalls GB10 — crossasset training freezes step 0 #118

Description

Summary

Pinned goroutine 1 (v1.9.0, ZERFOO_ARENA_SIZE_GB=64, GB10)

Why (1) is the key puzzle

Asks

Env / repro

Cross-refs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions