Skip to content

Add MoE (expert-routing) operators — torch.compile structurally can't optimize them (gfx1151 data) #156

@fxp

Description

@fxp

Summary

KernelBench currently has no MoE / expert-routing operators, yet sparse-MoE is core to modern LLM inference (gpt-oss, Mixtral, Qwen-MoE, DeepSeek). I'd like to contribute (a) a fused top-k MoE-SwiGLU problem + reference, (b) gfx1151 (AMD Strix Halo) results, and (c) an observation that makes MoE an especially meaningful KernelBench target.

Why MoE is a great KernelBench target: torch.compile structurally can't optimize it

The MoE forward has data-dependent control flow (per-expert token gather, if tok.any():), which graph-breaks torch.compile / inductor. Measured on gfx1151 (FP32, run_and_check, no_grad cuda-event):

time
eager 5.24 ms
torch.compile 5.25 ms ← no speedup (graph-break)
fused Triton kernel 4.75 ms

1.10x over BOTH eager and torch.compile, correctness max_abs_err 7e-7 (5/5), no precision downgrade.

Unlike simple elementwise/softmax/scan fusions — which inductor already fuses, so a Triton kernel only ties torch.compile at ~1.0x — MoE is a case where a hand-written kernel has a durable structural advantage. That makes it a much more discriminative benchmark for LLM-kernel-generation agents than fusions inductor already handles.

What I'd contribute

Would a MoE-style operator set be of interest for KernelBench? Happy to open a PR with the problem + reference + gfx1151 results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions