-
Notifications
You must be signed in to change notification settings - Fork 441
refactor(examples): consolidate vlm_ptq into llm_ptq #1705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -136,6 +136,29 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http | |
|
|
||
| > You can also create your own custom config using [this](https://nvidia.github.io/Model-Optimizer/guides/_pytorch_quantization.html#custom-calibration-algorithm) guide. | ||
|
|
||
| ### Vision Language Model (VLM) Supported Models | ||
|
|
||
| PTQ for vision-language models is handled by the same `hf_ptq.py` entry point and shell script as | ||
| LLMs — the language model is quantized while the vision encoder is kept in high precision. Pass | ||
| `--vlm` to the shell script (see [VLM quantization](#vlm-quantization)). | ||
|
|
||
| | Model | fp8 | int8_sq<sup>1</sup> | int4_awq | w4a8_awq<sup>2</sup> | nvfp4<sup>3</sup> | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we merge this list to https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq#hugging-face-supported-models? |
||
| | :---: | :---: | :---: | :---: | :---: | :---: | | ||
| | Llava | ✅ | ✅ | ✅ | ✅ | - | | ||
| | VILA<sup>4</sup> | ✅ | ✅ | ✅ | ✅ | - | | ||
| | Phi-3-vision, Phi-4-multimodal | ✅ | ✅ | ✅ | ✅ | ✅ | | ||
| | Qwen2, 2.5-VL | ✅ | ✅ | ✅ | ✅ | ✅ | | ||
| | Gemma3 | ✅ | - | - | - | - | | ||
| | Nemotron VL<sup>5</sup> | ✅ | - | - | - | ✅ | | ||
|
|
||
| > *<sup>1.</sup>Only TensorRT-LLM checkpoint export is supported. Not compatible with the TensorRT-LLM torch backend.* \ | ||
| > *<sup>2.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \ | ||
| > *<sup>3.</sup>A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.* \ | ||
| > *<sup>4.</sup>VILA requires `transformers<=4.50.0` and the original VILA repo; the shell script bootstraps both (see [`requirements-vila.txt`](./requirements-vila.txt)).* \ | ||
| > *<sup>5.</sup>Nemotron VL automatically calibrates with image-text pairs; see [VLM calibration with image-text pairs](#vlm-calibration-with-image-text-pairs-eg-nemotron-vl).* | ||
|
|
||
| > *For detailed TensorRT-LLM torch backend multimodal support, please refer to [this doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md#multimodal-feature-support-matrix-pytorch-backend).* | ||
|
|
||
| ## Framework Scripts | ||
|
|
||
| ### Hugging Face Example [Script](./scripts/huggingface_example.sh) | ||
|
|
@@ -243,6 +266,23 @@ The cast pins each NVFP4 block's `scale_2 = 2^(k_max - 8)` and `_amax = 6 * 2^k_ | |
|
|
||
| [PTQ for DeepSeek](../deepseek/README.md) shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM. | ||
|
|
||
| #### VLM quantization | ||
|
|
||
| Vision-language models are quantized through the same script. Add `--vlm` so the script bootstraps | ||
| any VLM-specific dependencies (e.g. VILA) and runs the TensorRT-LLM multimodal quickstart as the | ||
| deploy smoke test instead of the text-only one: | ||
|
|
||
| ```bash | ||
| scripts/huggingface_example.sh --model <Hugging Face model card or checkpoint> --quant fp8 --vlm | ||
| ``` | ||
|
|
||
| Supported `--quant` values for VLMs are `fp8`, `nvfp4`, `int8_sq`, `int4_awq`, and `w4a8_awq` (see | ||
| [VLM Supported Models](#vision-language-model-vlm-supported-models)). For VILA models the script | ||
| additionally installs [`requirements-vila.txt`](./requirements-vila.txt) and clones the VILA repo | ||
| next to the checkpoint. | ||
|
|
||
| > *This consolidates the former `examples/vlm_ptq` example, which now forwards here.* | ||
|
|
||
| #### VLM calibration with image-text pairs (e.g., Nemotron VL) | ||
|
|
||
| For vision-language models, calibration quality can likely improve by using image-text pairs instead of text-only data, especially on visual understanding tasks: | ||
|
|
@@ -257,6 +297,12 @@ python hf_ptq.py \ | |
| --calib_size 512 | ||
| ``` | ||
|
|
||
| The same flag is exposed by the shell script: | ||
|
|
||
| ```bash | ||
| scripts/huggingface_example.sh --model <model> --quant nvfp4 --vlm --calib_with_images --trust_remote_code | ||
| ``` | ||
|
|
||
| > Note: when `--calib_with_images` is set, `--calib_size` must be a single value, and the calibration dataset is nvidia/nemotron_vlm_dataset_v2. | ||
| This functionality is currently in beta and has been tested on `nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16`. | ||
|
|
||
|
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to drop Vila model support? ModelOpt min transformers is 4.56 so we cannot continue guaranteeing it works with 4.50
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| # Extra dependencies for quantizing NVILA / VILA vision-language models. | ||
| # | ||
| # VILA is not yet published on the Hugging Face model zoo, so its modeling code must be | ||
| # cloned separately (handled automatically by scripts/huggingface_example.sh --vlm) from | ||
| # https://github.com/Efficient-Large-Model/VILA.git at commit | ||
| # ec7fb2c264920bf004fd9fa37f1ec36ea0942db5. | ||
| # | ||
| # VILA's modeling code is only compatible with transformers<=4.50.0. | ||
| transformers<=4.50.0 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -84,8 +84,36 @@ if [ "${REMOVE_EXISTING_MODEL_CONFIG,,}" = "true" ]; then | |
| rm -f $MODEL_CONFIG | ||
| fi | ||
|
|
||
| # VILA vision-language models are not yet on the HF model zoo and require the original | ||
| # VILA repo plus an older transformers. Only triggered for VLM runs (--vlm) on VILA models. | ||
| if $VLM && [[ "${MODEL_NAME,,}" == *"vila"* ]]; then | ||
| # Check transformers version - must be <= 4.50.0 | ||
| CURRENT_TRANSFORMERS_VERSION=$(pip show transformers | grep Version | cut -d' ' -f2) | ||
| if [ "$(printf '%s\n' "4.50.0" "$CURRENT_TRANSFORMERS_VERSION" | sort -V | head -n1)" = "4.50.0" ] && [ "$CURRENT_TRANSFORMERS_VERSION" != "4.50.0" ]; then | ||
| echo "ERROR: transformers version $CURRENT_TRANSFORMERS_VERSION is not supported." >&2 | ||
| echo "VILA requires transformers<=4.50.0" >&2 | ||
| echo "Please refer to examples/llm_ptq/requirements-vila.txt for the supported versions." >&2 | ||
| echo "You also need to download VILA repository from https://github.com/Efficient-Large-Model/VILA.git and checkout ec7fb2c264920bf004fd9fa37f1ec36ea0942db5" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| pip install -r requirements-vila.txt | ||
| # Clone original VILA repo | ||
| if [ ! -d "$(dirname "$MODEL_PATH")/VILA" ]; then | ||
| echo "VILA repository is needed until it is added to HF model zoo. Cloning the repository parallel to $MODEL_PATH..." | ||
| git clone https://github.com/Efficient-Large-Model/VILA.git "$(dirname "$MODEL_PATH")/VILA" && \ | ||
| cd "$(dirname "$MODEL_PATH")/VILA" && \ | ||
| git checkout ec7fb2c264920bf004fd9fa37f1ec36ea0942db5 && \ | ||
| cd "$script_dir/.." | ||
| fi | ||
| fi | ||
|
|
||
| PTQ_ARGS="" | ||
|
|
||
| if $CALIB_WITH_IMAGES; then | ||
| PTQ_ARGS+=" --calib_with_images " | ||
| fi | ||
|
|
||
| if [ "$LOW_MEMORY_MODE" = "true" ]; then | ||
| PTQ_ARGS+=" --low_memory_mode " | ||
| fi | ||
|
|
@@ -223,7 +251,21 @@ if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH | |
| # Only run the deploy+generate smoke test when "quant" is explicitly requested. Eval tasks | ||
| # (lm_eval/mmlu/simple_eval) deploy the checkpoint themselves, so it is redundant there. | ||
| if [[ $TASKS =~ "quant" ]]; then | ||
| python run_tensorrt_llm.py --checkpoint_dir=$SAVE_PATH $RUN_ARGS | ||
| if $VLM; then | ||
| # VLMs use the TRT-LLM multimodal quickstart for the deploy smoke test. | ||
| if [ -z "$TRT_LLM_CODE_PATH" ]; then | ||
| TRT_LLM_CODE_PATH=/app/tensorrt_llm # default path for the TRT-LLM release docker image | ||
| echo "Setting default TRT_LLM_CODE_PATH to $TRT_LLM_CODE_PATH." | ||
| fi | ||
| QUICK_START_MULTIMODAL=$TRT_LLM_CODE_PATH/examples/llm-api/quickstart_multimodal.py | ||
| if [ -f "$QUICK_START_MULTIMODAL" ]; then | ||
| python3 $QUICK_START_MULTIMODAL --model_dir $SAVE_PATH --modality image | ||
| else | ||
|
Comment on lines
+262
to
+263
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win Quote smoke-test paths to avoid argument splitting. Line 262 and Line 267 expand path variables unquoted, so a save/code path containing spaces can break the Python command arguments. Proposed fix- python3 $QUICK_START_MULTIMODAL --model_dir $SAVE_PATH --modality image
+ python3 "$QUICK_START_MULTIMODAL" --model_dir "$SAVE_PATH" --modality image
...
- python run_tensorrt_llm.py --checkpoint_dir=$SAVE_PATH $RUN_ARGS
+ python run_tensorrt_llm.py --checkpoint_dir="$SAVE_PATH" $RUN_ARGSAlso applies to: 267-267 🧰 Tools🪛 Shellcheck (0.11.0)[info] 262-262: Double quote to prevent globbing and word splitting. (SC2086) [info] 262-262: Double quote to prevent globbing and word splitting. (SC2086) 🤖 Prompt for AI AgentsSource: Linters/SAST tools |
||
| echo "Warning: $QUICK_START_MULTIMODAL cannot be found. Please set TRT_LLM_CODE_PATH to the TRT-LLM code path or test the quantized checkpoint $SAVE_PATH with the TRT-LLM repo directly." | ||
| fi | ||
| else | ||
| python run_tensorrt_llm.py --checkpoint_dir=$SAVE_PATH $RUN_ARGS | ||
| fi | ||
| fi | ||
| fi | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to rename
examples/llm_ptqtoexamples/hf_ptq?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good idea. Though not sure if it will be a breaking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave a symlink from
examples/llm_ptq/to newexamples/hf_ptq/directory so previous path still remains valid and then we remove the symlink folder after few releases?