Skip to content

perf(cuda): fuse Laguna MoE expert paths#23

Draft
davide221 wants to merge 4 commits into
luce-dflashfrom
codex-laguna-moe-kv-cuda
Draft

perf(cuda): fuse Laguna MoE expert paths#23
davide221 wants to merge 4 commits into
luce-dflashfrom
codex-laguna-moe-kv-cuda

Conversation

@davide221

Copy link
Copy Markdown

Draft companion PR for the Laguna hub performance branch.

Summary:

  • Adds a fused ggml Laguna MoE combine op.
  • Extends CUDA MMVQ/MUL_MAT_ID for batched/tokenwise MoE expert paths and fusion.
  • Keeps the temporary FA tracing/vector-kernel experiments out of this branch.

Validation:

  • Built through the hub CUDA build: bench_laguna_generate and dflash_server.
  • Hub sanity run after this submodule commit: 128 prefill / 512 decode = 178.3 tok/s on RTX 3090 with f16 KV.

This is draft/WIP so we can preserve the branch while continuing byte-identical checks and the next decode-speed pass.

davide221 and others added 3 commits June 28, 2026 00:27
Keep the CUDA-only fused MoE op explicit in the CPU dispatcher so non-CUDA builds do not fail on missing switch coverage.

Co-authored-by: Codex <codex@openai.com>
Bump the RPC protocol patch guard after the op-count change, include TQ3_0 in the unsupported CPU clamp cases, and avoid unreachable breaks after noreturn CPU-only guards.

Co-authored-by: Codex <codex@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants