[AMD] Add RyzenAI passes for latest flows; Restore legacy VitisAI passes for eager flow for backward compatibility#2481
Open
poganesh wants to merge 4 commits into
Open
[AMD] Add RyzenAI passes for latest flows; Restore legacy VitisAI passes for eager flow for backward compatibility#2481poganesh wants to merge 4 commits into
poganesh wants to merge 4 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds a Quark + VitisAI quantization workflow (Torch + ONNX) into Olive, including model/data prep utilities, quantization configuration helpers, and a DBRX MoE expert-module replacement to support quantization/export.
Changes:
- Add
QuarkQuantizationVitisAIpass supporting Quark-ONNX and Quark-Torch flows and register it inolive_config.json. - Introduce Torch LLM PTQ utilities (model/data/config preparation, quantize runner) and ONNX config/runner utilities.
- Add DBRX expert module replacement for MoE quantization, and update Vitis LLM model generation pass API and config.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| olive/passes/quark_vitisai/torch/language_modeling/module_replacement/dbrx_expert.py | Adds DBRX MoE expert module replacement used during quantization. |
| olive/passes/quark_vitisai/torch/language_modeling/module_replacement/init.py | Package init for module replacement utilities. |
| olive/passes/quark_vitisai/torch/language_modeling/llm_utils/model_preparation.py | Adds tokenizer/model loading utilities and MoE preparation hook for DBRX. |
| olive/passes/quark_vitisai/torch/language_modeling/llm_utils/data_preparation.py | Adds calibration/training dataset utilities for PTQ. |
| olive/passes/quark_vitisai/torch/language_modeling/llm_utils/init.py | Package init for LLM utilities. |
| olive/passes/quark_vitisai/torch/language_modeling/llm_ptq/quantize_quark.py | Adds the Torch-side Quark quantization driver. |
| olive/passes/quark_vitisai/torch/language_modeling/llm_ptq/customized_configuration.py | Adds supported quant schemes and spec/config factories. |
| olive/passes/quark_vitisai/torch/language_modeling/llm_ptq/configuration_preparation.py | Builds Quark Torch quant/export configuration from CLI args/model-type. |
| olive/passes/quark_vitisai/torch/language_modeling/llm_ptq/init.py | Package init for PTQ module. |
| olive/passes/quark_vitisai/torch/language_modeling/init.py | Package init for torch language modeling module. |
| olive/passes/quark_vitisai/torch/init.py | Package init for quark_vitisai torch integration. |
| olive/passes/quark_vitisai/quark_quantization_vitisai.py | Implements Olive pass that routes to Quark-ONNX or Quark-Torch quantization. |
| olive/passes/quark_vitisai/onnx/quantize_quark.py | Adds ONNX-side Quark quantization runner wrapper. |
| olive/passes/quark_vitisai/onnx/configuration_preparation.py | Adds helpers to build ONNX QConfig (global/algo/extra options). |
| olive/passes/quark_vitisai/onnx/init.py | Package init for ONNX integration. |
| olive/passes/quark_vitisai/init.py | Package init for quark_vitisai pass package. |
| olive/passes/onnx/vitis_ai/vitis_generate_model_llm.py | Changes Vitis LLM model generation pass parameters/API and hardcodes output ONNX name. |
| olive/passes/onnx/ryzen_ai/ryzen_generate_model_llm.py | Adds Ryzen LLM model generation pass (existing full-feature config). |
| olive/passes/onnx/ryzen_ai/init.py | Package init for ryzen_ai passes. |
| olive/olive_config.json | Registers new Quark/VitisAI passes in Olive’s pass registry. |
Comment on lines
+164
to
+169
| q_layers_name = MODEL_NAME_Q_LAYERS_MAP[model_type] | ||
| layer_quant_config[q_layers_name] = QuantizationConfig( | ||
| input_tensors=global_quant_config.input_tensors, | ||
| weight=global_quant_config.weight, | ||
| output_tensors=attn_qspec, | ||
| ) |
Comment on lines
+251
to
+273
| def get_device_max_memory() -> dict[Union[int, str], Union[int, str]]: | ||
| for i in range(torch.cuda.device_count()): | ||
| _ = torch.tensor([0], device=i) | ||
| cuda_avail_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())} | ||
| cpu_avail_memory = psutil.virtual_memory().available | ||
| max_memory = {} | ||
| for cuda_num, cuda_memory in cuda_avail_memory.items(): | ||
| cuda_memory_gb = cuda_memory / (10**9) | ||
| logger.info("GPU%s cuda_avail_memory: %.1fGB", cuda_num, cuda_memory_gb) | ||
| if cuda_num == 0: | ||
| # The ratio is an experience value that you can manually adjust yourself. | ||
| gpu0_ratio = 0.5 if cuda_memory_gb > 30 else 0.3 | ||
| max_memory[cuda_num] = f"{cuda_memory_gb * gpu0_ratio:.1f}GB" | ||
| else: | ||
| other_ratio = 0.875 if cuda_memory_gb > 30 else 0.7 | ||
| max_memory[cuda_num] = f"{cuda_memory_gb * other_ratio:.1f}GB" | ||
| logger.info("cpu_avail_memory: %.1fGB", cpu_avail_memory / (10**9)) | ||
| cpu_ratio = 0.875 | ||
| max_memory["cpu"] = f"{cpu_avail_memory / (10**9) * cpu_ratio:.1f}GB" | ||
| logger.info("final_use_model_kwargs: %s", max_memory) | ||
| # max_memory = {0: '0.1GB', 'cpu': '100GB'} | ||
|
|
||
| return max_memory |
Comment on lines
+124
to
+127
| if tokenizer.pad_token != "<unk>": | ||
| tokenizer.pad_token = tokenizer.eos_token | ||
| if tokenizer.pad_token is None: | ||
| tokenizer.pad_token = tokenizer.eos_token |
Comment on lines
+20
to
+22
| def get_pileval( | ||
| tokenizer: PreTrainedTokenizer, nsamples: int, seqlen: int, device: str | None, seed: int = 0 | ||
| ) -> DataLoader[torch.Tensor]: |
Comment on lines
+146
to
+151
| def my_collate_fn(blocks: list[dict[str, list[list[str]]]]) -> dict[str, torch.Tensor]: | ||
| data_batch = {} | ||
| data_batch["input_ids"] = torch.Tensor([block["input_ids"] for block in blocks]) | ||
| if device: | ||
| data_batch["input_ids"] = data_batch["input_ids"].to(device) | ||
| return data_batch |
Comment on lines
+43
to
+45
| experts_module.mlp.w1 = None | ||
| experts_module.mlp.v1 = None | ||
| experts_module.mlp.w2 = None |
Comment on lines
51
to
54
| return ONNXModelHandler( | ||
| model_path=output_dir, | ||
| onnx_file_name=output_model_name, | ||
| onnx_file_name="model.onnx", | ||
| ) |
Comment on lines
+129
to
+130
| new_tmp_dir = tempfile.TemporaryDirectory(prefix="olive_tmp") # pylint: disable=R1732 | ||
| tmp_model_path = str(Path(new_tmp_dir.name) / Path(output_model_path).name) |
Comment on lines
+156
to
+158
| onnx_model = onnx.load(tmp_model_path) | ||
| # the model is loaded into memory, so it's safe to delete previously exported files | ||
| new_tmp_dir.cleanup() |
Contributor
Author
|
The co-pilot comments are on code which already existed and is just refactored into separate folders.
Hence our recommendation is to address these comments in a separate PR based on older Quark version if required. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe your changes
Adds the
RyzenGenerateModelLLMOlive pass for the new RyzenAI full-fusion / token-fusion flow on AMD NPU/hybrid devices, and restores two legacy passes to support the older VitisAI eager flow in parallel:VitisGenerateModelLLM(from commit a24d73a) - legacy eager model generationQuarkQuantizationVitisAI(from commit 1615bda, originallyQuarkQuantization) - legacy Quark 0.9 quantizationThe legacy quantization class was renamed to avoid a name collision with the existing
QuarkQuantizationpass (Quark 0.11+) used by the new fusion flow. Legacy module path:olive.passes.quark_vitisai.Pass mapping
RyzenGenerateModelLLMQuarkQuantizationVitisGenerateModelLLMQuarkQuantizationVitisAICompanion olive-recipes PR
microsoft/olive-recipes#PR-440— uses these passes per model in
RyzenAI/andVitisAI/subfolders.