Skip to content

Commit b7c2e2a

Browse files
evilsocketclaude
andcommitted
perf: skip Metal sync after QKV matmul during generation
The synchronize() call after the QKV projection was running on every token including generation (seq_len=1). Since generation now uses the fused SDPA kernel (few commands), the sync is unnecessary and adds ~4ms per full attention layer × 6 layers = ~24ms overhead per token. Benchmark (M3 Pro, Qwen3.5-0.8B, 50 tokens): - Before: 15.2 tok/s - After: 16.1 tok/s (+5.9%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 64b66e5 commit b7c2e2a

1 file changed

Lines changed: 5 additions & 3 deletions

File tree

cake-core/src/models/qwen3_5/full_attention.rs

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -138,9 +138,11 @@ impl Qwen3_5FullAttention {
138138
let qkv = self.backend.linear_forward(x, &self.qkv_proj_weight, None)
139139
.map_err(|e| anyhow!("qkv_proj: {e}"))?;
140140

141-
// Flush GPU commands after QKV matmul (always needed — full attention
142-
// accumulates ~24 commands between syncs, can't afford more)
143-
let _ = self.backend.synchronize();
141+
// Flush GPU commands after QKV matmul — needed for prefill where many
142+
// operations follow. Generation (seq_len=1) uses fused SDPA with few commands.
143+
if seq_len > 1 {
144+
let _ = self.backend.synchronize();
145+
}
144146

145147
// Split: Q (doubled for gating), K, V
146148
let q_out = qkv.narrow(D::Minus1, 0, self.q_size)

0 commit comments

Comments
 (0)