-
Notifications
You must be signed in to change notification settings - Fork 279
Add Automated QDQ placement example - Part 4.1 #841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
willg-nv
wants to merge
3
commits into
NVIDIA:main
Choose a base branch
from
willg-nv:dev-willg-integrate-auto-qdq-placement-part4.1
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,256 @@ | ||
| # QDQ Placement Optimization Example | ||
|
|
||
| This example demonstrates automated Q/DQ (Quantize/Dequantize) node placement optimization for ONNX models using TensorRT performance measurements. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| ### Get the Model | ||
|
|
||
| Download the ResNet50 model from the ONNX Model Zoo: | ||
|
|
||
| ```bash | ||
| # Download ResNet50 from ONNX Model Zoo | ||
| curl -L -o resnet50_Opset17.onnx https://github.com/onnx/models/raw/main/Computer_Vision/resnet50_Opset17_torch_hub/resnet50_Opset17.onnx | ||
| ``` | ||
|
|
||
| ### Set Fixed Batch Size (Recommended) | ||
|
|
||
| The downloaded model has a dynamic batch size. For best performance with TensorRT benchmarking, set a fixed batch size: | ||
|
|
||
| ```bash | ||
| # Set batch size to 128 using the provided script | ||
| python3 set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx | ||
|
|
||
| # Or for other batch sizes | ||
| python3 set_batch_size.py resnet50_Opset17.onnx --batch-size 1 --output resnet50.bs1.onnx | ||
| ``` | ||
|
|
||
| This creates `resnet50.bs128.onnx` with a fixed batch size of 128, which is optimal for TensorRT performance benchmarking. | ||
|
|
||
| **Note:** The script requires the `onnx` package. If you have modelopt installed, this dependency should already be available. | ||
|
|
||
| ### What's in This Directory | ||
|
|
||
| - `set_batch_size.py` - Script to convert dynamic batch size models to fixed batch size | ||
| - `README.md` - This guide | ||
|
|
||
| **Note:** ONNX model files are not included in the repository (excluded via `.gitignore`). Download and prepare them using the instructions above. | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### Basic Usage | ||
|
|
||
| Optimize the ResNet50 model with INT8 quantization: | ||
|
|
||
| ```bash | ||
| # Using the fixed batch size model (recommended) | ||
| python3 -m modelopt.onnx.quantization.autotune \ | ||
| --model resnet50.bs128.onnx \ | ||
| --output ./resnet50_results \ | ||
| --quant-type int8 \ | ||
| --schemes-per-region 30 | ||
|
|
||
| # Or use the original dynamic batch size model | ||
| python3 -m modelopt.onnx.quantization.autotune \ | ||
| --model resnet50_Opset17.onnx \ | ||
| --output ./resnet50_results \ | ||
| --quant-type int8 \ | ||
| --schemes-per-region 30 | ||
| ``` | ||
|
|
||
| This will: | ||
|
|
||
| 1. Automatically discover optimization regions in your model | ||
| 2. Test 30 different Q/DQ placement schemes per region pattern | ||
| 3. Measure TensorRT performance for each scheme | ||
| 4. Export the best optimized model to `./resnet50_results/optimized_final.onnx` | ||
|
|
||
| ### FP8 Quantization | ||
|
|
||
| For FP8 quantization (faster on modern GPUs): | ||
|
|
||
| ```bash | ||
| python3 -m modelopt.onnx.quantization.autotune \ | ||
| --model resnet50.bs128.onnx \ | ||
| --output ./resnet50_fp8_results \ | ||
| --quant-type fp8 \ | ||
| --schemes-per-region 50 | ||
| ``` | ||
|
|
||
| ### Faster Exploration | ||
|
|
||
| For quick experiments, reduce the number of schemes: | ||
|
|
||
| ```bash | ||
| python3 -m modelopt.onnx.quantization.autotune \ | ||
| --model resnet50.bs128.onnx \ | ||
| --output ./resnet50_quick \ | ||
| --schemes-per-region 15 | ||
| ``` | ||
|
|
||
| ## Output Structure | ||
|
|
||
| After running, you'll get: | ||
|
|
||
| ```log | ||
| resnet50_results/ | ||
| ├── optimized_final.onnx # Your optimized model | ||
| ├── baseline.onnx # Baseline for comparison | ||
| ├── autotuner_state.yaml # Resume checkpoint | ||
| ├── autotuner_state_pattern_cache.yaml # Reusable patterns | ||
| └── logs/ | ||
| ├── baseline.log # TensorRT baseline log | ||
| ├── region_*_scheme_*.log # Per-scheme logs | ||
| └── final.log # Final model log | ||
| ``` | ||
|
|
||
| ## Using the Optimized Model | ||
|
|
||
| Deploy with TensorRT: | ||
|
|
||
| ```bash | ||
| trtexec --onnx=resnet50_results/optimized_final.onnx \ | ||
| --saveEngine=resnet50.engine \ | ||
| --stronglyTyped | ||
| ``` | ||
|
|
||
| ## Pattern Cache | ||
|
|
||
| Reuse learned patterns on similar models: | ||
|
|
||
| ```bash | ||
| # First optimization on ResNet50 | ||
| python3 -m modelopt.onnx.quantization.autotune \ | ||
| --model resnet50.bs128.onnx \ | ||
| --output ./resnet50_run | ||
|
|
||
| # Download and prepare ResNet101 (or any similar model) | ||
| curl -L -o resnet101_Opset17.onnx https://github.com/onnx/models/raw/main/Computer_Vision/resnet101-v2-7.onnx | ||
| python3 set_batch_size.py resnet101_Opset17.onnx --batch-size 128 --output resnet101.bs128.onnx | ||
|
|
||
| # Reuse patterns from ResNet50 on ResNet101 (much faster!) | ||
| python3 -m modelopt.onnx.quantization.autotune \ | ||
| --model resnet101.bs128.onnx \ | ||
| --output ./resnet101_run \ | ||
| --pattern-cache ./resnet50_run/autotuner_state_pattern_cache.yaml | ||
| ``` | ||
|
|
||
| ## Optimize from Existing QDQ Model | ||
|
|
||
| If you already have a quantized model (e.g., from manual quantization or another tool), you can use it as a starting point to potentially find even better Q/DQ placements: | ||
|
|
||
| ```bash | ||
| # Use an existing QDQ model as baseline | ||
| python3 -m modelopt.onnx.quantization.autotune \ | ||
| --model resnet50.bs128.onnx \ | ||
| --output ./resnet50_improved \ | ||
| --qdq-baseline resnet50_quantized.onnx \ | ||
| --schemes-per-region 40 | ||
| ``` | ||
|
|
||
| This will: | ||
|
|
||
| 1. Extract Q/DQ insertion points from the baseline model | ||
| 2. Use them as seed schemes during optimization | ||
| 3. Generate and test variations to find better placements | ||
| 4. Compare against the baseline performance | ||
|
|
||
| **Use cases:** | ||
|
|
||
| - **Improve existing quantization**: Fine-tune manually quantized models | ||
| - **Compare tools**: Test if autotuner can beat other quantization methods | ||
| - **Bootstrap optimization**: Start from expert-tuned schemes | ||
|
|
||
| **Example workflow:** | ||
|
|
||
| ```bash | ||
| # Step 1: Create initial quantized model with any quantization tool | ||
| # For example, using modelopt's quantize function: | ||
| python3 -c " | ||
| import numpy as np | ||
| from modelopt.onnx.quantization import quantize | ||
|
|
||
| # Create dummy calibration data (replace with real data for production) | ||
| dummy_input = np.random.randn(128, 3, 224, 224).astype(np.float32) | ||
| quantize( | ||
| 'resnet50.bs128.onnx', | ||
| calibration_data=dummy_input, | ||
| calibration_method='entropy', | ||
| output_path='resnet50_quantized.onnx' | ||
| ) | ||
| " | ||
|
|
||
| # Step 2: Use the quantized baseline for autotuning | ||
| # The autotuner will try to find better Q/DQ placements than the initial quantization | ||
| python3 -m modelopt.onnx.quantization.autotune \ | ||
| --model resnet50.bs128.onnx \ | ||
| --output ./resnet50_autotuned \ | ||
| --qdq-baseline resnet50_quantized.onnx \ | ||
| --schemes-per-region 50 | ||
| ``` | ||
|
|
||
| **Note:** This example uses dummy calibration data. For production use, provide real calibration data representative of your inference workload. | ||
|
|
||
| ## Remote Autotuning with TensorRT | ||
|
|
||
| TensorRT 10.16+ supports remote autotuning, which allows you to offload TensorRT's optimization process to remote hardware. This is useful when you want to optimize models for different target GPUs without having direct access to them. | ||
|
|
||
| To use remote autotuning during Q/DQ placement optimization: | ||
|
|
||
| ```bash | ||
| python3 -m modelopt.onnx.quantization.autotune \ | ||
| --model resnet50.bs128.onnx \ | ||
| --output ./resnet50_remote_autotuned \ | ||
| --schemes-per-region 50 \ | ||
| --use_trtexec \ | ||
| --trtexec_benchmark_args "--remoteAutoTuningConfig=\"<remote autotuning config>\"" | ||
| ``` | ||
|
|
||
| **Requirements:** | ||
|
|
||
| - TensorRT 10.16 or later | ||
| - Valid remote autotuning configuration | ||
| - `--use_trtexec` flag must be enabled | ||
|
|
||
| Replace `<remote autotuning config>` with your actual remote autotuning configuration string provided by your TensorRT setup. | ||
|
|
||
| ## Programmatic API Usage | ||
|
|
||
| All examples above use the command-line interface. For **low-level programmatic control** in your Python code, use the Python API directly. This allows you to: | ||
|
|
||
| - Integrate autotuning into custom pipelines | ||
| - Implement custom evaluation functions | ||
| - Control state management and checkpointing | ||
| - Build custom optimization workflows | ||
|
|
||
| **See the API Reference documentation for low-level usage:** | ||
|
|
||
| - [`docs/source/reference/2_qdq_placement.rst`](../../docs/source/reference/2_qdq_placement.rst) | ||
|
|
||
| The API docs include detailed examples of: | ||
|
|
||
| - Using the `Autotuner` class directly | ||
| - Customizing region discovery and scheme generation | ||
| - Managing optimization state programmatically | ||
| - Implementing custom performance evaluators | ||
|
|
||
| ## Documentation | ||
|
|
||
| For comprehensive documentation on QDQ placement optimization, see: | ||
|
|
||
| - **User Guide**: [`docs/source/guides/9_qdq_placement.rst`](../../docs/source/guides/9_qdq_placement.rst) | ||
| - Detailed explanations of how the autotuner works | ||
| - Advanced usage patterns and best practices | ||
| - Configuration options and performance tuning | ||
| - Troubleshooting common issues | ||
|
|
||
| - **API Reference**: [`docs/source/reference/2_qdq_placement.rst`](../../docs/source/reference/2_qdq_placement.rst) | ||
| - Complete API documentation for all classes and functions | ||
| - Low-level usage examples | ||
| - State management and pattern cache details | ||
|
|
||
| For command-line help: | ||
|
|
||
| ```bash | ||
| python3 -m modelopt.onnx.quantization.autotune --help | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| #!/usr/bin/env python3 | ||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """ | ||
| Script to set a fixed batch size for ONNX models. | ||
|
|
||
| This script modifies an ONNX model with dynamic batch size to use a fixed batch size, | ||
| which is often beneficial for TensorRT performance benchmarking. | ||
|
|
||
| Usage: | ||
| python set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx | ||
| """ | ||
|
|
||
| import argparse | ||
|
|
||
| import onnx | ||
| from onnx import shape_inference | ||
|
|
||
|
|
||
| def set_batch_size(model_path: str, batch_size: int, output_path: str) -> None: | ||
| """ | ||
| Set a fixed batch size for an ONNX model. | ||
|
|
||
| Args: | ||
| model_path: Path to input ONNX model | ||
| batch_size: Desired batch size | ||
| output_path: Path to save modified model | ||
| """ | ||
| # Load the model | ||
| print(f"Loading model from {model_path}...") | ||
| model = onnx.load(model_path) | ||
|
|
||
| # Get the input tensor | ||
| graph = model.graph | ||
| input_tensor = graph.input[0] | ||
|
|
||
| print( | ||
| f"Original input shape: {[d.dim_param or d.dim_value for d in input_tensor.type.tensor_type.shape.dim]}" | ||
| ) | ||
|
|
||
| # Modify the batch dimension (first dimension) | ||
| if len(input_tensor.type.tensor_type.shape.dim) > 0: | ||
| input_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size | ||
| # Clear any symbolic dimension parameter | ||
| input_tensor.type.tensor_type.shape.dim[0].ClearField("dim_param") | ||
|
|
||
| # Also update output shapes if needed | ||
| for output_tensor in graph.output: | ||
| if len(output_tensor.type.tensor_type.shape.dim) > 0: | ||
| output_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size | ||
| output_tensor.type.tensor_type.shape.dim[0].ClearField("dim_param") | ||
|
|
||
| print( | ||
| f"Modified input shape: {[d.dim_param or d.dim_value for d in input_tensor.type.tensor_type.shape.dim]}" | ||
| ) | ||
|
|
||
| # Run shape inference to propagate the batch size through the model | ||
| print("Running shape inference...") | ||
| try: | ||
| model = shape_inference.infer_shapes(model) | ||
| except Exception as e: | ||
| print(f"Warning: Shape inference failed: {e}") | ||
| print("Continuing without shape inference...") | ||
|
|
||
| # Save the modified model | ||
| print(f"Saving modified model to {output_path}...") | ||
| onnx.save(model, output_path) | ||
|
|
||
| # Verify the saved model | ||
| print("Verifying model...") | ||
| onnx.checker.check_model(output_path) | ||
| print("✓ Model saved and verified successfully!") | ||
|
|
||
|
|
||
| def main(): | ||
| parser = argparse.ArgumentParser( | ||
| description="Set a fixed batch size for an ONNX model", | ||
| formatter_class=argparse.RawDescriptionHelpFormatter, | ||
| epilog=""" | ||
| Examples: | ||
| # Set batch size to 128 for ResNet50 | ||
| python set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx | ||
|
|
||
| # Set batch size to 1 for single-image inference | ||
| python set_batch_size.py resnet50_Opset17.onnx --batch-size 1 --output resnet50.bs1.onnx | ||
| """, | ||
| ) | ||
|
|
||
| parser.add_argument("model", help="Path to input ONNX model") | ||
| parser.add_argument( | ||
| "--batch-size", "-b", type=int, default=128, help="Batch size to set (default: 128)" | ||
| ) | ||
| parser.add_argument( | ||
| "--output", "-o", help="Path to save modified model (default: <model>_bs<batch_size>.onnx)" | ||
| ) | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| # Generate output path if not provided | ||
| if args.output is None: | ||
| base_name = args.model.rsplit(".", 1)[0] | ||
| args.output = f"{base_name}.bs{args.batch_size}.onnx" | ||
|
|
||
| set_batch_size(args.model, args.batch_size, args.output) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.