Summary
KernelBench currently has no MoE / expert-routing operators, yet sparse-MoE is core to modern LLM inference (gpt-oss, Mixtral, Qwen-MoE, DeepSeek). I'd like to contribute (a) a fused top-k MoE-SwiGLU problem + reference, (b) gfx1151 (AMD Strix Halo) results, and (c) an observation that makes MoE an especially meaningful KernelBench target.
Why MoE is a great KernelBench target: torch.compile structurally can't optimize it
The MoE forward has data-dependent control flow (per-expert token gather, if tok.any():), which graph-breaks torch.compile / inductor. Measured on gfx1151 (FP32, run_and_check, no_grad cuda-event):
|
time |
| eager |
5.24 ms |
| torch.compile |
5.25 ms ← no speedup (graph-break) |
| fused Triton kernel |
4.75 ms |
→ 1.10x over BOTH eager and torch.compile, correctness max_abs_err 7e-7 (5/5), no precision downgrade.
Unlike simple elementwise/softmax/scan fusions — which inductor already fuses, so a Triton kernel only ties torch.compile at ~1.0x — MoE is a case where a hand-written kernel has a durable structural advantage. That makes it a much more discriminative benchmark for LLM-kernel-generation agents than fusions inductor already handles.
What I'd contribute
Would a MoE-style operator set be of interest for KernelBench? Happy to open a PR with the problem + reference + gfx1151 results.
Summary
KernelBench currently has no MoE / expert-routing operators, yet sparse-MoE is core to modern LLM inference (gpt-oss, Mixtral, Qwen-MoE, DeepSeek). I'd like to contribute (a) a fused top-k MoE-SwiGLU problem + reference, (b) gfx1151 (AMD Strix Halo) results, and (c) an observation that makes MoE an especially meaningful KernelBench target.
Why MoE is a great KernelBench target: torch.compile structurally can't optimize it
The MoE forward has data-dependent control flow (per-expert token gather,
if tok.any():), which graph-breaks torch.compile / inductor. Measured on gfx1151 (FP32,run_and_check, no_grad cuda-event):→ 1.10x over BOTH eager and torch.compile, correctness max_abs_err 7e-7 (5/5), no precision downgrade.
Unlike simple elementwise/softmax/scan fusions — which inductor already fuses, so a Triton kernel only ties torch.compile at ~1.0x — MoE is a case where a hand-written kernel has a durable structural advantage. That makes it a much more discriminative benchmark for LLM-kernel-generation agents than fusions inductor already handles.
What I'd contribute
Would a MoE-style operator set be of interest for KernelBench? Happy to open a PR with the problem + reference + gfx1151 results.