selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch#2473
selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch#2473hanbitmyths wants to merge 4 commits into
Conversation
…TI_GPU dispatch - Normalize per-layer quant config overrides so Q/K/V projections in the same attention block share precision, required by ModelBuilder for GQA fusion. - Add AUTO setting for kld_memory_mode that picks among FULL, MULTI_GPU, LOW_MEMORY, OFFLOAD based on available GPU memory and model size. - Add MULTI_GPU mode that uses Accelerate's dispatch_model with _no_split_modules honored, plus a coalescing pass that pins every model.layers.N.* entry to a single device and falls back to LOW_MEMORY if a decoder layer still spans devices. - Tests: 24 unit tests covering QKV grouping, AUTO selection thresholds, and the MULTI_GPU device-map coalescing path.
There was a problem hiding this comment.
Pull request overview
This PR strengthens the SelectiveMixedPrecision (SMP) PyTorch pass for LLMs targeting ONNX Runtime GenAI by (a) enforcing Q/K/V consistency in both scored selection and quantization overrides, and (b) adding an auto/multi_gpu KLD-gradient scoring memory mode selection to make scoring practical on large models.
Changes:
- Add Q/K/V-aware grouping so scored selection promotes attention input projections together, and normalize quantization overrides so Q/K/V share the most-precise config.
- Introduce
kld_memory_modewithautoresolution plus a newmulti_gpumode using Accelerate dispatch and device-map coalescing/validation. - Expand unit tests to cover QKV grouping/normalization, KLD scoring equivalence across memory modes, and AUTO/MULTI_GPU selection behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
olive/passes/pytorch/selective_mixed_precision.py |
Adds QKV grouping in scored overrides and implements AUTO/FULL/MULTI_GPU/LOW_MEMORY/OFFLOAD KLD scoring paths with heuristics and Accelerate-based sharding. |
olive/passes/pytorch/quant_utils.py |
Adds QKV group discovery + override normalization to ensure attention input projections share a consistent quant config, including support for excluded attention inputs. |
test/passes/pytorch/test_selective_mixed_precision.py |
Adds extensive unit tests for QKV grouping/normalization and KLD scoring/memory-mode behavior, including MULTI_GPU dispatch stubbing. |
| scored_items = [ | ||
| ( | ||
| group, | ||
| sum(module_numels[module_name] for module_name in group), |
There was a problem hiding this comment.
thanks for opening this PR! I was thinking about needing to keep qkv the same settings but didn't get around to it.
I don't think summing the scores is a good way to aggregate the scores since this would just make the scores for qkv that are summed higher than those for single modules.
i have a commit on top your branch here at 8f722f3 and created a draft PR for an alternative + some refactor of the codebase to make it more modular. I also updated the prepare_model's normalize qkv config behavior to account for already quantized modules. #2475
There was a problem hiding this comment.
Since your PR is created from a fork, we are unable to run the CI and need to create a copy of your branch to make it work. could you please create the branch and PR directly from the original repository? You should already have contributor access.
for this PR, we could also merge the copy PR i made if you are happy with my changes I made on top. Thanks!
|
Superseded by #2477.\n\nContinuing review and updates on the origin-branch PR to avoid fork-head workflow. |
|
Closing as superseded by #2477. |
…TI_GPU dispatch (#2475) ## Describe your changes Based on #2473 ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. ## (Optional) Issue link --------- Co-authored-by: Sunghoon Choi <sunghcho@microsoft.com> Co-authored-by: Sunghoon Choi <35605090+hanbitmyths@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This PR hardens
SelectiveMixedPrecision(SMP) for real-world LLMs targeting ONNX Runtime GenAI:QKV-aware quant config overrides (
olive/passes/pytorch/quant_utils.py): Normalize the per-layer override dict so that the Q, K, and V projections in the same attention block always share precision. ModelBuilder's GQA fusion requires this; without it, partial overrides silently break export on Qwen-style models.AUTO
kld_memory_mode(olive/passes/pytorch/selective_mixed_precision.py): A newautosetting selects amongfull,multi_gpu,low_memory, andoffloadbased on visible GPU memory and estimated model footprint, and logs the decision (e.g.KLD memory mode auto-selected: multi_gpu (gpus=3, full=145.14GB, multi_budget=215.86GB, ...)).New
multi_gpumode: Usesaccelerate.dispatch_model+infer_auto_device_mapwith_no_split_moduleshonored. Afterinfer_auto_device_map, everymodel.layers.N.*entry is coalesced to the first device assigned for that layer, and a defensive check falls back tolow_memoryif a decoder layer still spans devices. A diagnostic info log reports the per-device layer counts.Validation (A100 VM)
new_missing_qkv_partners=[]), same 657 MB output, ~301 vs 309 tok/s.MMLU 0-shot (HF fp16 vs ort-genai int4, greedy)
14B is essentially lossless; the small-model deltas are inherent to int4 SMP on sub-2B parameters, not regressions introduced here.
Checklist before requesting a review
test_selective_mixed_precision.py)lintrunner -aRelease note:
SelectiveMixedPrecisionnow supports anautosetting forkld_memory_modeand a newmulti_gpumode that shards the KLD-scored forward across visible GPUs via Accelerate. Quant config overrides are normalized so Q/K/V projections in the same attention block share precision, ensuring compatibility with ModelBuilder GQA fusion.