feat(ggml-cuda): MoE expert-streaming policy + mechanism + fused Q4 kernels#22
Draft
dusterbloom wants to merge 4 commits into
Draft
feat(ggml-cuda): MoE expert-streaming policy + mechanism + fused Q4 kernels#22dusterbloom wants to merge 4 commits into
dusterbloom wants to merge 4 commits into
Conversation
… kernels
Memory-constrained MoE expert streaming for GPUs that page experts from
host/unified memory over a slow link (e.g. RTX 3090 over PCIe Gen4 x4),
where the link — not compute — bounds MoE decode.
Policy (pure C++, CPU-tested):
- residency_planner (P3): decayed-frequency resident-set selection with
anti-thrash hysteresis.
- plan_prefetch (P1): demand-ordered cold-expert prefetch under a bandwidth cap.
Mechanism (CUDA, sm_86-tested):
- expert_cache: VRAM slot pool, per-slot ready events, policy-aware eviction;
ensure_resident streams H2D on demand, prefetch overlaps on a copy stream so
the verify pass finds experts resident.
- expert-stream-hooks: per-layer registry + opt-in hot-path hooks (no-op until
a layer is registered, so the mul_mat_id splice cannot disturb the default path).
Fused MoE kernels (sm_86-tested vs CPU oracle):
- moe-fused Q4_0 (abs 2.2e-6) and Q4_K (rel 6.4e-6), gather+dequant+route in one
launch; Q4_K dequant mirrors ggml get_scale_min_k4 bit-for-bit.
Tests: test-expert-stream 15/15, test-expert-cache 21/21, test-expert-hooks
10/10, test-moe-fused-q4{,k} PASS. Splice points in EXPERT-STREAM-INTEGRATION.md.
Call-site buffer wiring + nsys benchmark on 35B-A3B deferred.
The fused-MoE Q4_0/Q4_K test kernels used uint8_t via MSVC's transitive <cstdint>; native Linux nvcc (CUDA 12, gcc) needs it included explicitly. Caught by running the suite on the real Lucebox (lucebox3, RTX 3090). Suite stays 5/5 green on both Windows (MSVC) and Linux (gcc) sm_86.
Measured on lucebox3 (RTX 3090, PCIe x4 @ ~6.5 GB/s) with a new microbenchmark modeling one Qwen3.6-35B-A3B MoE layer (bench-expert-stream.cu). Root cause (found by instrumenting H2D copy counts, not by assuming): naive prefetch evicted experts that were still in the current step's working set, then re-streamed them — issuing ~75% more H2D than on-demand (701 vs 402 copies), so prefetch was *slower* despite copy/compute overlap working fine on x4. Fix: eviction now (a) uses decayed LFU usage instead of arbitrary id, and (b) never evicts an expert in the current step's working set (tracked via observe()). H2D copies drop to parity (443 == 443) and prefetch overlaps the draft pass as intended: +25-38% per layer-step (saves ~all the cold-expert streaming when the draft window covers it). Suite stays 5/5 green on the box. Adds: residency_planner::usage(), expert_cache::h2d_copies telemetry, working-set protection, and the microbenchmark.
…e draft Adds FINDINGS.md from measurements on the actual Lucebox (RTX 3090 sm_86, PCIe x4 ~6.5 GB/s, native dflash_server, Qwen3.6-35B-A3B): - Two regimes: expert streaming is the consumer-VRAM bottleneck ONLY when the model doesn't fit. Comfortable residency (91% resident) -> x4 idle (~2%), autoregressive-bound; tight budget -> x4 saturates (3-5 GB/s), 1.9x slower. - TQ3_0 KV cache costs ~11% vs Q4_0 at this config (free win). - Prefetch: mechanism proven on the real x4 link (working-set-protected eviction -> H2D copy parity 443==443, +25-38%/layer-step when the model doesn't fit); draft->expert RECALL stays untested — the available DFlash drafter delivers only ~7% (no Domino aux heads + CUDA-graph churn), so speculation never hits its 2-3x. Mechanism-proven, recall-pending. Scopes the PR honestly: the fused Q4_0/Q4_K MoE kernel and the working-set eviction fix (transferable to Spark's cache) stand regardless of the draft.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
… kernels
Memory-constrained MoE expert streaming for GPUs that page experts from host/unified memory over a slow link (e.g. RTX 3090 over PCIe Gen4 x4), where the link — not compute — bounds MoE decode.
Policy (pure C++, CPU-tested):
Mechanism (CUDA, sm_86-tested):
Fused MoE kernels (sm_86-tested vs CPU oracle):
Tests: test-expert-stream 15/15, test-expert-cache 21/21, test-expert-hooks 10/10, test-moe-fused-q4{,k} PASS. Splice points in EXPERT-STREAM-INTEGRATION.md. Call-site buffer wiring + nsys benchmark on 35B-A3B deferred.
Overview
Additional information
Requirements