Add MoE (expert-routing) operators — torch.compile structurally can't optimize them (gfx1151 data)

## Summary

KernelBench currently has no **MoE / expert-routing** operators, yet sparse-MoE is core to modern LLM inference (gpt-oss, Mixtral, Qwen-MoE, DeepSeek). I'd like to contribute (a) a fused top-k MoE-SwiGLU problem + reference, (b) gfx1151 (AMD Strix Halo) results, and (c) an observation that makes MoE an especially meaningful KernelBench target.

## Why MoE is a great KernelBench target: torch.compile structurally can't optimize it

The MoE forward has **data-dependent control flow** (per-expert token gather, `if tok.any():`), which **graph-breaks torch.compile / inductor**. Measured on gfx1151 (FP32, `run_and_check`, no_grad cuda-event):

| | time |
|---|---|
| eager | 5.24 ms |
| **torch.compile** | **5.25 ms** ← no speedup (graph-break) |
| fused Triton kernel | **4.75 ms** |

→ **1.10x over BOTH eager and torch.compile**, correctness max_abs_err **7e-7** (5/5), no precision downgrade.

Unlike simple elementwise/softmax/scan fusions — which inductor already fuses, so a Triton kernel only *ties* torch.compile at ~1.0x — MoE is a case where a hand-written kernel has a **durable structural advantage**. That makes it a much more discriminative benchmark for LLM-kernel-generation agents than fusions inductor already handles.

## What I'd contribute

- A problem: **fused top-k MoE SwiGLU** (router softmax + top-k + per-expert grouped GEMM + SwiGLU + weighted scatter-add) with a clean PyTorch reference.
- gfx1151 baseline timing + a compliant FP32 Triton solution (fuses gate/up GEMM, SwiGLU activation, and weighted scatter-add; GEMM stays on rocBLAS).
- Context: this comes from running an LLM agent (GLM-5.1) generating Triton kernels on AMD gfx1151; the broader results incl. an FP16 precision-downgrade reward-hack case are in #155.

Would a MoE-style operator set be of interest for KernelBench? Happy to open a PR with the problem + reference + gfx1151 results.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MoE (expert-routing) operators — torch.compile structurally can't optimize them (gfx1151 data) #156

Summary

Why MoE is a great KernelBench target: torch.compile structurally can't optimize it

What I'd contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	time
eager	5.24 ms
torch.compile	5.25 ms ← no speedup (graph-break)
fused Triton kernel	4.75 ms

Add MoE (expert-routing) operators — torch.compile structurally can't optimize them (gfx1151 data) #156

Description

Summary

Why MoE is a great KernelBench target: torch.compile structurally can't optimize it

What I'd contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions