Track the dropout op added for plan task BPB.3a (beat-pytorch-baseline).
What
A general-purpose inverted-dropout op for ztensor with a deterministic, seedable mask, on both the CPU engine and the GB10 GPU (f32), passing all three quality gates.
Design
- Counter-based Philox4x32-10 RNG, keyed by (seed, linear element offset). Stateless and parallel-friendly; the same (seed, offset) gives the same draw on CPU (Go,
compute/philox.go) and GPU (CUDA, internal/cuda/kernels/dropout.cu), so masks are bit-identical -- which is what makes CPU-GPU parity pass. No cuRAND stateful generators.
- Recompute the mask in backward, never cache it. The mask is a pure function of (seed, offset, p); backward recomputes it deterministically. Capture-safe; nothing pinned across an arena reset (ADR 006).
- Inverted-dropout semantics matching
torch.nn.functional.dropout: training y = x*mask/(1-p), eval / p==0 exact identity. p scalar, validated in [0,1).
- Exposed as an optional capability interface (
compute.Dropouter[T]) + an optional gpuapi.Dropouter KernelRunner extension, so the core Engine interface is untouched and non-float32 / no-GPU paths report unavailability rather than stubbing.
Gates (all green)
- gradcheck:
Dropout OpInfo (p=0.3, fixed seed, [4,8]); deterministic mask => exact linear map, finite-diff == analytic backward.
- CPU-GPU parity (GB10):
parity PASS Dropout both schedules, fwd & bwd max_abs=0.000e+00 (bit-identical); dedicated TestGPUDropout_CPUParity / TestGPUDropout_Backward_CPUParity PASS on the GB10 via Spark.
- PyTorch oracle: SkipReason -- torch's training-mode mask uses its own Philox word->element mapping; matching it would mean reimplementing ztensor's Philox in the torch runner (the HadamardTransform precedent). Mask-vs-input math is pinned by gradcheck + parity instead; eval-mode identity is unit-tested.
PR: see linked.
Track the dropout op added for plan task BPB.3a (beat-pytorch-baseline).
What
A general-purpose inverted-dropout op for ztensor with a deterministic, seedable mask, on both the CPU engine and the GB10 GPU (f32), passing all three quality gates.
Design
compute/philox.go) and GPU (CUDA,internal/cuda/kernels/dropout.cu), so masks are bit-identical -- which is what makes CPU-GPU parity pass. No cuRAND stateful generators.torch.nn.functional.dropout: trainingy = x*mask/(1-p), eval /p==0exact identity.pscalar, validated in [0,1).compute.Dropouter[T]) + an optionalgpuapi.DropouterKernelRunner extension, so the coreEngineinterface is untouched and non-float32 / no-GPU paths report unavailability rather than stubbing.Gates (all green)
DropoutOpInfo (p=0.3, fixed seed, [4,8]); deterministic mask => exact linear map, finite-diff == analytic backward.parity PASS Dropoutboth schedules, fwd & bwdmax_abs=0.000e+00(bit-identical); dedicatedTestGPUDropout_CPUParity/TestGPUDropout_Backward_CPUParityPASS on the GB10 via Spark.PR: see linked.