Skip to content

Dropout op (CPU + GB10 GPU) with deterministic Philox mask #168

Description

@dndungu

Track the dropout op added for plan task BPB.3a (beat-pytorch-baseline).

What

A general-purpose inverted-dropout op for ztensor with a deterministic, seedable mask, on both the CPU engine and the GB10 GPU (f32), passing all three quality gates.

Design

  • Counter-based Philox4x32-10 RNG, keyed by (seed, linear element offset). Stateless and parallel-friendly; the same (seed, offset) gives the same draw on CPU (Go, compute/philox.go) and GPU (CUDA, internal/cuda/kernels/dropout.cu), so masks are bit-identical -- which is what makes CPU-GPU parity pass. No cuRAND stateful generators.
  • Recompute the mask in backward, never cache it. The mask is a pure function of (seed, offset, p); backward recomputes it deterministically. Capture-safe; nothing pinned across an arena reset (ADR 006).
  • Inverted-dropout semantics matching torch.nn.functional.dropout: training y = x*mask/(1-p), eval / p==0 exact identity. p scalar, validated in [0,1).
  • Exposed as an optional capability interface (compute.Dropouter[T]) + an optional gpuapi.Dropouter KernelRunner extension, so the core Engine interface is untouched and non-float32 / no-GPU paths report unavailability rather than stubbing.

Gates (all green)

  • gradcheck: Dropout OpInfo (p=0.3, fixed seed, [4,8]); deterministic mask => exact linear map, finite-diff == analytic backward.
  • CPU-GPU parity (GB10): parity PASS Dropout both schedules, fwd & bwd max_abs=0.000e+00 (bit-identical); dedicated TestGPUDropout_CPUParity / TestGPUDropout_Backward_CPUParity PASS on the GB10 via Spark.
  • PyTorch oracle: SkipReason -- torch's training-mode mask uses its own Philox word->element mapping; matching it would mean reimplementing ztensor's Philox in the torch runner (the HadamardTransform precedent). Mask-vs-input math is pinned by gradcheck + parity instead; eval-mode identity is unit-tested.

PR: see linked.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions