feat(ggml-cuda): MoE expert-streaming policy + mechanism + fused Q4 kernels by dusterbloom · Pull Request #22 · Luce-Org/llama.cpp-dflash-ggml

dusterbloom · 2026-06-26T11:08:03Z

… kernels

Memory-constrained MoE expert streaming for GPUs that page experts from host/unified memory over a slow link (e.g. RTX 3090 over PCIe Gen4 x4), where the link — not compute — bounds MoE decode.

Policy (pure C++, CPU-tested):

residency_planner (P3): decayed-frequency resident-set selection with anti-thrash hysteresis.
plan_prefetch (P1): demand-ordered cold-expert prefetch under a bandwidth cap.

Mechanism (CUDA, sm_86-tested):

expert_cache: VRAM slot pool, per-slot ready events, policy-aware eviction; ensure_resident streams H2D on demand, prefetch overlaps on a copy stream so the verify pass finds experts resident.
expert-stream-hooks: per-layer registry + opt-in hot-path hooks (no-op until a layer is registered, so the mul_mat_id splice cannot disturb the default path).

Fused MoE kernels (sm_86-tested vs CPU oracle):

moe-fused Q4_0 (abs 2.2e-6) and Q4_K (rel 6.4e-6), gather+dequant+route in one launch; Q4_K dequant mirrors ggml get_scale_min_k4 bit-for-bit.

Tests: test-expert-stream 15/15, test-expert-cache 21/21, test-expert-hooks 10/10, test-moe-fused-q4{,k} PASS. Splice points in EXPERT-STREAM-INTEGRATION.md. Call-site buffer wiring + nsys benchmark on 35B-A3B deferred.

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

… kernels Memory-constrained MoE expert streaming for GPUs that page experts from host/unified memory over a slow link (e.g. RTX 3090 over PCIe Gen4 x4), where the link — not compute — bounds MoE decode. Policy (pure C++, CPU-tested): - residency_planner (P3): decayed-frequency resident-set selection with anti-thrash hysteresis. - plan_prefetch (P1): demand-ordered cold-expert prefetch under a bandwidth cap. Mechanism (CUDA, sm_86-tested): - expert_cache: VRAM slot pool, per-slot ready events, policy-aware eviction; ensure_resident streams H2D on demand, prefetch overlaps on a copy stream so the verify pass finds experts resident. - expert-stream-hooks: per-layer registry + opt-in hot-path hooks (no-op until a layer is registered, so the mul_mat_id splice cannot disturb the default path). Fused MoE kernels (sm_86-tested vs CPU oracle): - moe-fused Q4_0 (abs 2.2e-6) and Q4_K (rel 6.4e-6), gather+dequant+route in one launch; Q4_K dequant mirrors ggml get_scale_min_k4 bit-for-bit. Tests: test-expert-stream 15/15, test-expert-cache 21/21, test-expert-hooks 10/10, test-moe-fused-q4{,k} PASS. Splice points in EXPERT-STREAM-INTEGRATION.md. Call-site buffer wiring + nsys benchmark on 35B-A3B deferred.

The fused-MoE Q4_0/Q4_K test kernels used uint8_t via MSVC's transitive <cstdint>; native Linux nvcc (CUDA 12, gcc) needs it included explicitly. Caught by running the suite on the real Lucebox (lucebox3, RTX 3090). Suite stays 5/5 green on both Windows (MSVC) and Linux (gcc) sm_86.

Measured on lucebox3 (RTX 3090, PCIe x4 @ ~6.5 GB/s) with a new microbenchmark modeling one Qwen3.6-35B-A3B MoE layer (bench-expert-stream.cu). Root cause (found by instrumenting H2D copy counts, not by assuming): naive prefetch evicted experts that were still in the current step's working set, then re-streamed them — issuing ~75% more H2D than on-demand (701 vs 402 copies), so prefetch was *slower* despite copy/compute overlap working fine on x4. Fix: eviction now (a) uses decayed LFU usage instead of arbitrary id, and (b) never evicts an expert in the current step's working set (tracked via observe()). H2D copies drop to parity (443 == 443) and prefetch overlaps the draft pass as intended: +25-38% per layer-step (saves ~all the cold-expert streaming when the draft window covers it). Suite stays 5/5 green on the box. Adds: residency_planner::usage(), expert_cache::h2d_copies telemetry, working-set protection, and the microbenchmark.

…e draft Adds FINDINGS.md from measurements on the actual Lucebox (RTX 3090 sm_86, PCIe x4 ~6.5 GB/s, native dflash_server, Qwen3.6-35B-A3B): - Two regimes: expert streaming is the consumer-VRAM bottleneck ONLY when the model doesn't fit. Comfortable residency (91% resident) -> x4 idle (~2%), autoregressive-bound; tight budget -> x4 saturates (3-5 GB/s), 1.9x slower. - TQ3_0 KV cache costs ~11% vs Q4_0 at this config (free win). - Prefetch: mechanism proven on the real x4 link (working-set-protected eviction -> H2D copy parity 443==443, +25-38%/layer-step when the model doesn't fit); draft->expert RECALL stays untested — the available DFlash drafter delivers only ~7% (no Domino aux heads + CUDA-graph churn), so speculation never hits its 2-3x. Mechanism-proven, recall-pending. Scopes the PR honestly: the fused Q4_0/Q4_K MoE kernel and the working-set eviction fix (transferable to Spark's cache) stand regardless of the draft.

github-actions Bot added documentation Improvements or additions to documentation ggml CUDA testing labels Jun 26, 2026

dusterbloom added 3 commits June 26, 2026 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ggml-cuda): MoE expert-streaming policy + mechanism + fused Q4 kernels#22

feat(ggml-cuda): MoE expert-streaming policy + mechanism + fused Q4 kernels#22
dusterbloom wants to merge 4 commits into
Luce-Org:luce-dflashfrom
dusterbloom:feat/moe-expert-stream

dusterbloom commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dusterbloom commented Jun 26, 2026

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant