Skip to content

feat(ggml-cuda): MoE expert-streaming policy + mechanism + fused Q4 kernels#22

Draft
dusterbloom wants to merge 4 commits into
Luce-Org:luce-dflashfrom
dusterbloom:feat/moe-expert-stream
Draft

feat(ggml-cuda): MoE expert-streaming policy + mechanism + fused Q4 kernels#22
dusterbloom wants to merge 4 commits into
Luce-Org:luce-dflashfrom
dusterbloom:feat/moe-expert-stream

Conversation

@dusterbloom

Copy link
Copy Markdown

… kernels

Memory-constrained MoE expert streaming for GPUs that page experts from host/unified memory over a slow link (e.g. RTX 3090 over PCIe Gen4 x4), where the link — not compute — bounds MoE decode.

Policy (pure C++, CPU-tested):

  • residency_planner (P3): decayed-frequency resident-set selection with anti-thrash hysteresis.
  • plan_prefetch (P1): demand-ordered cold-expert prefetch under a bandwidth cap.

Mechanism (CUDA, sm_86-tested):

  • expert_cache: VRAM slot pool, per-slot ready events, policy-aware eviction; ensure_resident streams H2D on demand, prefetch overlaps on a copy stream so the verify pass finds experts resident.
  • expert-stream-hooks: per-layer registry + opt-in hot-path hooks (no-op until a layer is registered, so the mul_mat_id splice cannot disturb the default path).

Fused MoE kernels (sm_86-tested vs CPU oracle):

  • moe-fused Q4_0 (abs 2.2e-6) and Q4_K (rel 6.4e-6), gather+dequant+route in one launch; Q4_K dequant mirrors ggml get_scale_min_k4 bit-for-bit.

Tests: test-expert-stream 15/15, test-expert-cache 21/21, test-expert-hooks 10/10, test-moe-fused-q4{,k} PASS. Splice points in EXPERT-STREAM-INTEGRATION.md. Call-site buffer wiring + nsys benchmark on 35B-A3B deferred.

Overview

Additional information

Requirements

… kernels

Memory-constrained MoE expert streaming for GPUs that page experts from
host/unified memory over a slow link (e.g. RTX 3090 over PCIe Gen4 x4),
where the link — not compute — bounds MoE decode.

Policy (pure C++, CPU-tested):
- residency_planner (P3): decayed-frequency resident-set selection with
  anti-thrash hysteresis.
- plan_prefetch (P1): demand-ordered cold-expert prefetch under a bandwidth cap.

Mechanism (CUDA, sm_86-tested):
- expert_cache: VRAM slot pool, per-slot ready events, policy-aware eviction;
  ensure_resident streams H2D on demand, prefetch overlaps on a copy stream so
  the verify pass finds experts resident.
- expert-stream-hooks: per-layer registry + opt-in hot-path hooks (no-op until
  a layer is registered, so the mul_mat_id splice cannot disturb the default path).

Fused MoE kernels (sm_86-tested vs CPU oracle):
- moe-fused Q4_0 (abs 2.2e-6) and Q4_K (rel 6.4e-6), gather+dequant+route in one
  launch; Q4_K dequant mirrors ggml get_scale_min_k4 bit-for-bit.

Tests: test-expert-stream 15/15, test-expert-cache 21/21, test-expert-hooks
10/10, test-moe-fused-q4{,k} PASS. Splice points in EXPERT-STREAM-INTEGRATION.md.
Call-site buffer wiring + nsys benchmark on 35B-A3B deferred.
@github-actions github-actions Bot added documentation Improvements or additions to documentation ggml CUDA testing labels Jun 26, 2026
The fused-MoE Q4_0/Q4_K test kernels used uint8_t via MSVC's transitive
<cstdint>; native Linux nvcc (CUDA 12, gcc) needs it included explicitly.
Caught by running the suite on the real Lucebox (lucebox3, RTX 3090).
Suite stays 5/5 green on both Windows (MSVC) and Linux (gcc) sm_86.
Measured on lucebox3 (RTX 3090, PCIe x4 @ ~6.5 GB/s) with a new microbenchmark
modeling one Qwen3.6-35B-A3B MoE layer (bench-expert-stream.cu).

Root cause (found by instrumenting H2D copy counts, not by assuming): naive
prefetch evicted experts that were still in the current step's working set, then
re-streamed them — issuing ~75% more H2D than on-demand (701 vs 402 copies), so
prefetch was *slower* despite copy/compute overlap working fine on x4.

Fix: eviction now (a) uses decayed LFU usage instead of arbitrary id, and
(b) never evicts an expert in the current step's working set (tracked via
observe()). H2D copies drop to parity (443 == 443) and prefetch overlaps the
draft pass as intended: +25-38% per layer-step (saves ~all the cold-expert
streaming when the draft window covers it). Suite stays 5/5 green on the box.

Adds: residency_planner::usage(), expert_cache::h2d_copies telemetry,
working-set protection, and the microbenchmark.
…e draft

Adds FINDINGS.md from measurements on the actual Lucebox (RTX 3090 sm_86, PCIe
x4 ~6.5 GB/s, native dflash_server, Qwen3.6-35B-A3B):

- Two regimes: expert streaming is the consumer-VRAM bottleneck ONLY when the
  model doesn't fit. Comfortable residency (91% resident) -> x4 idle (~2%),
  autoregressive-bound; tight budget -> x4 saturates (3-5 GB/s), 1.9x slower.
- TQ3_0 KV cache costs ~11% vs Q4_0 at this config (free win).
- Prefetch: mechanism proven on the real x4 link (working-set-protected eviction
  -> H2D copy parity 443==443, +25-38%/layer-step when the model doesn't fit);
  draft->expert RECALL stays untested — the available DFlash drafter delivers
  only ~7% (no Domino aux heads + CUDA-graph churn), so speculation never hits
  its 2-3x. Mechanism-proven, recall-pending.

Scopes the PR honestly: the fused Q4_0/Q4_K MoE kernel and the working-set
eviction fix (transferable to Spark's cache) stand regardless of the draft.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CUDA documentation Improvements or additions to documentation ggml testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant