TurboQuant KV cache (1/4): graph rewrite + schema (foundation)#28560
Draft
TimPietrusky wants to merge 1 commit into
Draft
TurboQuant KV cache (1/4): graph rewrite + schema (foundation)#28560TimPietrusky wants to merge 1 commit into
TimPietrusky wants to merge 1 commit into
Conversation
Adds a `TurboQuantKVFusion` graph transformer that rewrites every
GroupQueryAttention node at session-create time to use a TurboQuant
4-bit packed KV cache, plus the schema, session-option keys, and CPU
helpers required for that rewrite. No kernels in this PR — they
land in follow-ups for CUDA and WebGPU.
What this PR includes:
* `core/optimizer/turboquant_kv_fusion.{cc,h}` — the L2 transformer.
Enabled by setting `optimization.turboquant_kv_method` to one of
`turboquant_4bit_nc`, `turboquant_k3v4_nc`, `turboquant_3bit_nc`.
Runs on CUDA + WebGPU EPs. Computes Lloyd-Max centroids for the
given (head_dim, key_bits) and a normalised Walsh–Hadamard matrix,
injects both as graph initializers, and mutates each GQA node's
attributes + past/present tensor types to (uint8, slot_bytes).
* `core/graph/contrib_ops/bert_defs.cc` — extends GroupQueryAttention
with the new attributes (`kv_quant_method`, `key_quant_bits`,
`value_quant_bits`, `norm_correction`) and two new optional inputs
at slots 14 / 15 for the shared k_codebook + hadamard initializers.
* `include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h`
— public option keys `optimization.turboquant_kv_method` and
`optimization.turboquant_kv_boundary`.
* `contrib_ops/cpu/bert/attention_common.h` + `attention_parameters.h`
+ `group_query_attention_helper.h` — `KVQuantMethod` enum, parameter
struct extensions, and `CheckInputs` updates so the fp16 codepath
passes through unchanged when TurboQuant isn't requested.
* `include/onnxruntime/core/framework/int3.h` — new packed `UInt3x8`
type for 3-bit cache slots. Used by the (forthcoming) 3-bit
variants.
* `test/contrib_ops/turboquant_kv_test.cc` — host-side bit-layout
tests for `UInt3x8`. Kernel-level correctness is validated by the
follow-up CUDA / WebGPU PRs.
When `optimization.turboquant_kv_method` is unset or set to "none" /
"off" the transformer doesn't fire and the graph is byte-identical
to today's output.
Design doc + reference NumPy implementation + paper-validation tests
are coming in the Python tooling PR. The CUDA kernels (16-bit accum
WMMA + 4-bit packed cache) and the WebGPU kernels (WGSL encode/decode
with an ApplyAttention fallback for browsers without Subgroups) come
in separate PRs that each depend on this one.
Benches (LFM2.5-1.2B, RTX A40, all measured):
ctx fp16 decode TQ decode speedup
4 K 6.2 s reply 6.0 s reply tied
32 K 26 s 24 s 7 %
64 K 63 s 41 s 53 %
128 K (fp16 OOM) 65 s TQ only
Contributor
|
@TimPietrusky please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
This was referenced May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds the foundation pieces for a TurboQuant 4-bit KV-cache compression path inside ORT: a session-create-time graph rewriter, schema extensions on
GroupQueryAttention, and the public session-option keys that opt users in. No kernels in this PR — they ship in two follow-ups for CUDA and WebGPU, each able to be reviewed in isolation.The goal: when a user loads a stock q4f16 ONNX export from HuggingFace and sets one session option, ORT rewrites every
GroupQueryAttentionnode in memory to use a 4-bit packed KV cache (Lloyd-Max codebook for K, asymmetric uniform for V, Walsh-Hadamard pre-rotation). No offline conversion, no second.onnxon disk, no transformers.js / HF Hub helpers to teach about a new dtype.Why TurboQuant
KV cache eats memory linearly with context length. At 128K context, a 1.2B-param model's fp16 KV is ~1.5 GB and dominates VRAM. TurboQuant compresses it ~3.6× to ~430 MB without retraining and with cosine similarity 0.99+ vs fp16 on every layer we benched (LFM2.5-1.2B, Qwen3.5-0.8B-Text). Paper: https://arxiv.org/abs/2412.10319 — implementation is bit-exact against vLLM's reference where they overlap.
What this PR contains
core/optimizer/turboquant_kv_fusion.{cc,h}— the L2 graph transformer. Fires whenoptimization.turboquant_kv_methodis set to a non-empty preset. Scope:{kCudaExecutionProvider, kWebGpuExecutionProvider}. Computes Lloyd-Max centroids for the given (head_dim, key_bits) and a normalised Walsh-Hadamard matrix, injects both as shared graph initializers, then mutates eachGroupQueryAttentionnode's attributes + past/present tensor types to(uint8, slot_bytes).core/graph/contrib_ops/bert_defs.cc— extendsGroupQueryAttentionwith four new attributes (kv_quant_method,key_quant_bits,value_quant_bits,norm_correction) and two new optional inputs at slots 14/15 for the sharedk_codebook+hadamardinitializers. All defaulted so the standard fp16 path is unchanged when the option is unset.include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h— two new public option keys:optimization.turboquant_kv_method(preset name) andoptimization.turboquant_kv_boundary(number of first/last layers to leave in fp16 for accuracy).contrib_ops/cpu/bert/{attention_common.h,attention_parameters.h,group_query_attention_helper.h}— theKVQuantMethodenum, parameter struct extensions, andCheckInputsupdates that let the fp16 codepath remain byte-identical while the TQ codepath has the data it needs.include/onnxruntime/core/framework/int3.h— a new packedUInt3x8type used by the 3-bit cache variants (turboquant_k3v4_nc,turboquant_3bit_nc).test/contrib_ops/turboquant_kv_test.cc— host-side bit-layout tests forUInt3x8. Kernel-level correctness is validated by the CUDA / WebGPU PRs that land on top of this one.What's NOT in this PR (intentionally)
last_token_logitsmodel patcher — separate PR.When
optimization.turboquant_kv_methodis unset (the default), nothing in this PR runs and graph optimisation is byte-identical to today.Verified locally
onnxruntime_provider_test --gtest_filter=TurboQuantKVTest.*Motivation and Context
Same long-context inference TurboQuant already accelerates inside vLLM, but available inside the ORT ecosystem (CUDA, Apple Silicon Metal, WebGPU EP via Dawn) — including the browser via onnxruntime-web. Drop-in via one session option; no model conversion.