diff --git a/CHANGELOG.rst b/CHANGELOG.rst index c9e0ba3c684..591ed0c857f 100755 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -4,6 +4,10 @@ Changelog 0.46 (2026-xx-xx) ^^^^^^^^^^^^^^^^^ +**Deprecations** + +- Consolidated ``examples/vlm_ptq`` into ``examples/llm_ptq``. Vision-language model PTQ now shares the ``hf_ptq.py`` entry point and ``scripts/huggingface_example.sh``; pass ``--vlm`` to bootstrap VLM-specific dependencies (e.g. VILA via ``examples/llm_ptq/requirements-vila.txt``) and run the TensorRT-LLM multimodal quickstart smoke test. The ``examples/vlm_ptq/scripts/huggingface_example.sh`` entry point is deprecated: it now prints a warning and forwards to the ``llm_ptq`` script with ``--vlm``, and will be removed in a future release. See `examples/llm_ptq/README.md `__. + **New Features** - Add the ``day0-release`` agent skill (``.agents/skills/day0-release/``), a deterministic end-to-end driver that chains the PTQ → evaluation → comparison skills (the evaluation stage deploys the checkpoint itself) with an enforced gate after each stage and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Ships three GPU-free, unit-tested gate scripts (``gate_ptq.py``, ``gate_run.py``, ``gate_compare.py``) that validate checkpoint coverage, evaluation-run completeness, and baseline-vs-candidate accuracy threshold. v1 reports and stops on regression; the recipe-search loop is deferred. diff --git a/README.md b/README.md index 4852bfa7884..f60ee9ed511 100644 --- a/README.md +++ b/README.md @@ -102,7 +102,7 @@ more fine-grained control on installed dependencies or for alternative docker im | **Technique** | **Description** | **Examples** | **Docs** | | :------------: | :------------: | :------------: | :------------: | -| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | \[[LLMs](./examples/llm_ptq/)\] \[[diffusers](./examples/diffusers/)\] \[[VLMs](./examples/vlm_ptq/)\] \[[onnx](./examples/onnx_ptq/)\] \[[windows](./examples/windows/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] | +| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | \[[LLMs](./examples/llm_ptq/)\] \[[diffusers](./examples/diffusers/)\] \[[VLMs](./examples/llm_ptq/README.md#vlm-quantization)\] \[[onnx](./examples/onnx_ptq/)\] \[[windows](./examples/windows/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] | | Quantization Aware Training | Refine accuracy even further with a few training steps! | \[[Hugging Face](./examples/llm_qat/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] | | Pruning | Reduce your model size and accelerate inference by removing unnecessary weights! | \[[General](./examples/pruning/)\] \[[Megatron-Bridge](./examples/megatron_bridge/README.md#pruning)\] | | | Distillation | Reduce deployment model size by teaching small models to behave like larger models! | \[[Megatron-Bridge](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-bridge-framework)\] \[[Megatron-LM](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-lm-framework)\] \[[Hugging Face](./examples/llm_distill/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\] | @@ -132,7 +132,7 @@ more fine-grained control on installed dependencies or for alternative docker im |------------|----------------| | LLM Quantization | [View Support Matrix](./examples/llm_ptq/README.md#support-matrix) | | Diffusers Quantization | [View Support Matrix](./examples/diffusers/README.md#support-matrix) | -| VLM Quantization | [View Support Matrix](./examples/vlm_ptq/README.md#support-matrix) | +| VLM Quantization | [View Support Matrix](./examples/llm_ptq/README.md#vision-language-model-vlm-supported-models) | | ONNX Quantization | [View Support Matrix](./examples/torch_onnx/README.md#onnx-export-supported-llm-models) | | Windows Quantization | [View Support Matrix](./examples/windows/README.md#support-matrix) | | Quantization Aware Training | [View Support Matrix](./examples/llm_qat/README.md#support-matrix) | diff --git a/examples/llm_ptq/README.md b/examples/llm_ptq/README.md index 64ef6deaa01..869ac265f80 100755 --- a/examples/llm_ptq/README.md +++ b/examples/llm_ptq/README.md @@ -136,6 +136,29 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http > You can also create your own custom config using [this](https://nvidia.github.io/Model-Optimizer/guides/_pytorch_quantization.html#custom-calibration-algorithm) guide. +### Vision Language Model (VLM) Supported Models + +PTQ for vision-language models is handled by the same `hf_ptq.py` entry point and shell script as +LLMs — the language model is quantized while the vision encoder is kept in high precision. Pass +`--vlm` to the shell script (see [VLM quantization](#vlm-quantization)). + +| Model | fp8 | int8_sq1 | int4_awq | w4a8_awq2 | nvfp43 | +| :---: | :---: | :---: | :---: | :---: | :---: | +| Llava | ✅ | ✅ | ✅ | ✅ | - | +| VILA4 | ✅ | ✅ | ✅ | ✅ | - | +| Phi-3-vision, Phi-4-multimodal | ✅ | ✅ | ✅ | ✅ | ✅ | +| Qwen2, 2.5-VL | ✅ | ✅ | ✅ | ✅ | ✅ | +| Gemma3 | ✅ | - | - | - | - | +| Nemotron VL5 | ✅ | - | - | - | ✅ | + +> *1.Only TensorRT-LLM checkpoint export is supported. Not compatible with the TensorRT-LLM torch backend.* \ +> *2.The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \ +> *3.A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.* \ +> *4.VILA requires `transformers<=4.50.0` and the original VILA repo; the shell script bootstraps both (see [`requirements-vila.txt`](./requirements-vila.txt)).* \ +> *5.Nemotron VL automatically calibrates with image-text pairs; see [VLM calibration with image-text pairs](#vlm-calibration-with-image-text-pairs-eg-nemotron-vl).* + +> *For detailed TensorRT-LLM torch backend multimodal support, please refer to [this doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md#multimodal-feature-support-matrix-pytorch-backend).* + ## Framework Scripts ### Hugging Face Example [Script](./scripts/huggingface_example.sh) @@ -243,6 +266,23 @@ The cast pins each NVFP4 block's `scale_2 = 2^(k_max - 8)` and `_amax = 6 * 2^k_ [PTQ for DeepSeek](../deepseek/README.md) shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM. +#### VLM quantization + +Vision-language models are quantized through the same script. Add `--vlm` so the script bootstraps +any VLM-specific dependencies (e.g. VILA) and runs the TensorRT-LLM multimodal quickstart as the +deploy smoke test instead of the text-only one: + +```bash +scripts/huggingface_example.sh --model --quant fp8 --vlm +``` + +Supported `--quant` values for VLMs are `fp8`, `nvfp4`, `int8_sq`, `int4_awq`, and `w4a8_awq` (see +[VLM Supported Models](#vision-language-model-vlm-supported-models)). For VILA models the script +additionally installs [`requirements-vila.txt`](./requirements-vila.txt) and clones the VILA repo +next to the checkpoint. + +> *This consolidates the former `examples/vlm_ptq` example, which now forwards here.* + #### VLM calibration with image-text pairs (e.g., Nemotron VL) For vision-language models, calibration quality can likely improve by using image-text pairs instead of text-only data, especially on visual understanding tasks: @@ -257,6 +297,12 @@ python hf_ptq.py \ --calib_size 512 ``` +The same flag is exposed by the shell script: + +```bash +scripts/huggingface_example.sh --model --quant nvfp4 --vlm --calib_with_images --trust_remote_code +``` + > Note: when `--calib_with_images` is set, `--calib_size` must be a single value, and the calibration dataset is nvidia/nemotron_vlm_dataset_v2. This functionality is currently in beta and has been tested on `nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16`. diff --git a/examples/llm_ptq/requirements-vila.txt b/examples/llm_ptq/requirements-vila.txt new file mode 100644 index 00000000000..eb93298bc7f --- /dev/null +++ b/examples/llm_ptq/requirements-vila.txt @@ -0,0 +1,9 @@ +# Extra dependencies for quantizing NVILA / VILA vision-language models. +# +# VILA is not yet published on the Hugging Face model zoo, so its modeling code must be +# cloned separately (handled automatically by scripts/huggingface_example.sh --vlm) from +# https://github.com/Efficient-Large-Model/VILA.git at commit +# ec7fb2c264920bf004fd9fa37f1ec36ea0942db5. +# +# VILA's modeling code is only compatible with transformers<=4.50.0. +transformers<=4.50.0 diff --git a/examples/llm_ptq/scripts/huggingface_example.sh b/examples/llm_ptq/scripts/huggingface_example.sh index 3f51e5b73f3..eff5b1e106e 100755 --- a/examples/llm_ptq/scripts/huggingface_example.sh +++ b/examples/llm_ptq/scripts/huggingface_example.sh @@ -84,8 +84,36 @@ if [ "${REMOVE_EXISTING_MODEL_CONFIG,,}" = "true" ]; then rm -f $MODEL_CONFIG fi +# VILA vision-language models are not yet on the HF model zoo and require the original +# VILA repo plus an older transformers. Only triggered for VLM runs (--vlm) on VILA models. +if $VLM && [[ "${MODEL_NAME,,}" == *"vila"* ]]; then + # Check transformers version - must be <= 4.50.0 + CURRENT_TRANSFORMERS_VERSION=$(pip show transformers | grep Version | cut -d' ' -f2) + if [ "$(printf '%s\n' "4.50.0" "$CURRENT_TRANSFORMERS_VERSION" | sort -V | head -n1)" = "4.50.0" ] && [ "$CURRENT_TRANSFORMERS_VERSION" != "4.50.0" ]; then + echo "ERROR: transformers version $CURRENT_TRANSFORMERS_VERSION is not supported." >&2 + echo "VILA requires transformers<=4.50.0" >&2 + echo "Please refer to examples/llm_ptq/requirements-vila.txt for the supported versions." >&2 + echo "You also need to download VILA repository from https://github.com/Efficient-Large-Model/VILA.git and checkout ec7fb2c264920bf004fd9fa37f1ec36ea0942db5" >&2 + exit 1 + fi + + pip install -r requirements-vila.txt + # Clone original VILA repo + if [ ! -d "$(dirname "$MODEL_PATH")/VILA" ]; then + echo "VILA repository is needed until it is added to HF model zoo. Cloning the repository parallel to $MODEL_PATH..." + git clone https://github.com/Efficient-Large-Model/VILA.git "$(dirname "$MODEL_PATH")/VILA" && \ + cd "$(dirname "$MODEL_PATH")/VILA" && \ + git checkout ec7fb2c264920bf004fd9fa37f1ec36ea0942db5 && \ + cd "$script_dir/.." + fi +fi + PTQ_ARGS="" +if $CALIB_WITH_IMAGES; then + PTQ_ARGS+=" --calib_with_images " +fi + if [ "$LOW_MEMORY_MODE" = "true" ]; then PTQ_ARGS+=" --low_memory_mode " fi @@ -223,7 +251,21 @@ if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH # Only run the deploy+generate smoke test when "quant" is explicitly requested. Eval tasks # (lm_eval/mmlu/simple_eval) deploy the checkpoint themselves, so it is redundant there. if [[ $TASKS =~ "quant" ]]; then - python run_tensorrt_llm.py --checkpoint_dir=$SAVE_PATH $RUN_ARGS + if $VLM; then + # VLMs use the TRT-LLM multimodal quickstart for the deploy smoke test. + if [ -z "$TRT_LLM_CODE_PATH" ]; then + TRT_LLM_CODE_PATH=/app/tensorrt_llm # default path for the TRT-LLM release docker image + echo "Setting default TRT_LLM_CODE_PATH to $TRT_LLM_CODE_PATH." + fi + QUICK_START_MULTIMODAL=$TRT_LLM_CODE_PATH/examples/llm-api/quickstart_multimodal.py + if [ -f "$QUICK_START_MULTIMODAL" ]; then + python3 $QUICK_START_MULTIMODAL --model_dir $SAVE_PATH --modality image + else + echo "Warning: $QUICK_START_MULTIMODAL cannot be found. Please set TRT_LLM_CODE_PATH to the TRT-LLM code path or test the quantized checkpoint $SAVE_PATH with the TRT-LLM repo directly." + fi + else + python run_tensorrt_llm.py --checkpoint_dir=$SAVE_PATH $RUN_ARGS + fi fi fi diff --git a/examples/llm_ptq/scripts/parser.sh b/examples/llm_ptq/scripts/parser.sh index 3efed91bc32..06b440e5731 100644 --- a/examples/llm_ptq/scripts/parser.sh +++ b/examples/llm_ptq/scripts/parser.sh @@ -37,9 +37,11 @@ parse_options() { VERBOSE=true USE_SEQ_DEVICE_MAP=false CAST_MXFP4_TO_NVFP4=false + VLM=false + CALIB_WITH_IMAGES=false # Parse command-line options - ARGS=$(getopt -o "" -l "model:,quant:,recipe:,kv_cache_quant:,tp:,pp:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,simple_eval_limit:,mmlu_limit:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:,calib_seq:,auto_quantize_method:,auto_quantize_score_size:,auto_quantize_checkpoint:,moe_calib_experts_ratio:,cast_mxfp4_to_nvfp4" -n "$0" -- "$@") + ARGS=$(getopt -o "" -l "model:,quant:,recipe:,kv_cache_quant:,tp:,pp:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,simple_eval_limit:,mmlu_limit:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:,calib_seq:,auto_quantize_method:,auto_quantize_score_size:,auto_quantize_checkpoint:,moe_calib_experts_ratio:,cast_mxfp4_to_nvfp4,vlm,calib_with_images" -n "$0" -- "$@") eval set -- "$ARGS" while true; do @@ -76,6 +78,8 @@ parse_options() { --auto_quantize_checkpoint ) AUTO_QUANTIZE_CHECKPOINT="$2"; shift 2;; --moe_calib_experts_ratio ) MOE_CALIB_EXPERTS_RATIO="$2"; shift 2;; --cast_mxfp4_to_nvfp4 ) CAST_MXFP4_TO_NVFP4=true; shift;; + --vlm ) VLM=true; shift;; + --calib_with_images ) CALIB_WITH_IMAGES=true; shift;; -- ) shift; break ;; * ) break ;; esac @@ -176,5 +180,7 @@ parse_options() { echo "auto_quantize_checkpoint: $AUTO_QUANTIZE_CHECKPOINT" echo "moe_calib_experts_ratio: $MOE_CALIB_EXPERTS_RATIO" echo "cast_mxfp4_to_nvfp4: $CAST_MXFP4_TO_NVFP4" + echo "vlm: $VLM" + echo "calib_with_images: $CALIB_WITH_IMAGES" echo "=================" } diff --git a/examples/vlm_ptq/README.md b/examples/vlm_ptq/README.md index 8b9c31aa429..d36a82345e2 100644 --- a/examples/vlm_ptq/README.md +++ b/examples/vlm_ptq/README.md @@ -1,80 +1,32 @@ -# Post-training quantization (PTQ) for Vision Language Models +# [Deprecated] Post-training quantization (PTQ) for Vision Language Models -To learn more about the quantization feature, please refer to the [documentation](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html). +> **This example has been consolidated into [`examples/llm_ptq`](../llm_ptq/README.md) and is +> deprecated.** It will be removed in a future release. VLM PTQ now shares the same entry point +> (`hf_ptq.py`) and shell script as LLM PTQ. -Quantization is an effective model optimization technique that compresses your models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. \ -Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4 and supports advanced algorithms such as SmoothQuant, AWQ, SVDQuant, and Double Quantization with easy-to-use Python APIs. +## Migration -This section focuses on Post-training quantization for VLM (Vision Language Models), a technique that reduces model precision after training to improve inference efficiency without requiring retraining. - -
- -| **Section** | **Description** | **Link** | **Docs** | -| :------------: | :------------: | :------------: | :------------: | -| Pre-Requisites | Required & optional packages to use this technique | \[[Link](#pre-requisites)\] | | -| Getting Started | Learn how to optimize your models using PTQ to reduce precision and improve inference efficiency | \[[Link](#getting-started)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] | -| Support Matrix | View the support matrix to see quantization compatibility and feature availability across different models | \[[Link](#support-matrix)\] | | -| Framework Scripts | Example scripts demonstrating quantization techniques for optimizing Hugging Face / Megatron-Bridge / Megatron-LM models | \[[Link](#framework-scripts)\] | | -| Pre-Quantized Checkpoints | Ready to deploy Hugging Face pre-quantized checkpoints | \[[Link](#pre-quantized-checkpoints)\] | | -| Resources | Extra links to relevant resources | \[[Link](#resources)\] | | - -
- -## Pre-Requisites - -Please refer to the [llm_ptq/README.md](../llm_ptq/README.md#pre-requisites) for the pre-requisites. - -## Getting Started - -Please refer to the [llm_ptq/README.md](../llm_ptq/README.md#getting-started) for the getting-started. - -## Support Matrix - -### Supported Models - -| Model | fp8 | int8_sq1 | int4_awq | w4a8_awq2 | nvfp43 | -| :---: | :---: | :---: | :---: | :---: | :---: | -| Llava | ✅ | ✅ | ✅ | ✅ | - | -| VILA | ✅ | ✅ | ✅ | ✅ | - | -| Phi-3-vision, Phi-4-multimodal | ✅ | ✅ | ✅ | ✅ | ✅ | -| Qwen2, 2.5-VL | ✅ | ✅ | ✅ | ✅ | ✅ | -| Gemma3 | ✅ | - | - | - | - | - -> *1.Only TensorRT-LLM checkpoint export is supported. Not compatible with the TensorRT-LLM torch backend* \ -> *2.The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \ -> *3.A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.* - -> *For detailed TensorRT-LLM torch backend multimodal support, please refer to [this doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md#multimodal-feature-support-matrix-pytorch-backend)* - -> *The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try either modifying [hf_ptq.py](../llm_ptq/hf_ptq.py) and disabling the KV cache quantization or using the [QAT](./../llm_qat/README.md) instead.* - -## Framework Scripts - -Please refer to the [llm_ptq/README.md](../llm_ptq/README.md) about the details of model quantization. - -The following scripts provide an all-in-one and step-by-step model quantization example for the supported Hugging Face multi-modal models. The quantization format and the number of GPUs will be supplied as inputs to these scripts. - -### Hugging Face Example [Script](./scripts/huggingface_example.sh) +Use the `llm_ptq` script with the `--vlm` flag: ```bash -scripts/huggingface_example.sh --model --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq] +cd examples/llm_ptq +scripts/huggingface_example.sh --model --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq] --vlm ``` -### Megatron-Bridge Example - -Please refer to the [examples/megatron_bridge/](../megatron_bridge/README.md) for example scripts for PTQ with Megatron-Bridge. +The previous `examples/vlm_ptq/scripts/huggingface_example.sh` entry point still works: it now +prints a deprecation warning and forwards to the command above. -## Pre-Quantized Checkpoints +## Where things moved -- Ready-to-deploy checkpoints \[[🤗 Hugging Face - Nvidia Model Optimizer Collection](https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer)\] -- Deployable on [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) -- More models coming soon! +| Topic | New location | +| :--- | :--- | +| Supported VLMs / support matrix | [llm_ptq/README.md#vision-language-model-vlm-supported-models](../llm_ptq/README.md#vision-language-model-vlm-supported-models) | +| VLM quantization workflow (`--vlm`) | [llm_ptq/README.md#vlm-quantization](../llm_ptq/README.md#vlm-quantization) | +| Image-text calibration (`--calib_with_images`) | [llm_ptq/README.md#vlm-calibration-with-image-text-pairs-eg-nemotron-vl](../llm_ptq/README.md#vlm-calibration-with-image-text-pairs-eg-nemotron-vl) | +| VILA dependencies | [llm_ptq/requirements-vila.txt](../llm_ptq/requirements-vila.txt) | +| Megatron-Bridge VLM PTQ | [examples/megatron_bridge/](../megatron_bridge/README.md) | ## Resources -- 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146) - 📖 [Documentation](https://nvidia.github.io/Model-Optimizer) -- 🎯 [Benchmarks](../benchmark.md) - 💡 [Release Notes](https://nvidia.github.io/Model-Optimizer/reference/0_changelog.html) -- 🐛 [File a bug](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=1_bug_report.md) -- ✨ [File a Feature Request](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=2_feature_request.md) diff --git a/examples/vlm_ptq/scripts/huggingface_example.sh b/examples/vlm_ptq/scripts/huggingface_example.sh index eada1c137f8..2e42a3c768d 100755 --- a/examples/vlm_ptq/scripts/huggingface_example.sh +++ b/examples/vlm_ptq/scripts/huggingface_example.sh @@ -14,130 +14,21 @@ # See the License for the specific language governing permissions and # limitations under the License. -set -e - -script_dir="$(dirname "$(readlink -f "$0")")" - -source $script_dir/../../llm_ptq/scripts/parser.sh -parse_options "$@" - -set -x - -# This will prevent the script from hanging on Selene/EOS due to the MPI support. -echo "********** unset all SLURM_, PMI_, PMIX_ Variables **********" -for i in $(env | grep ^SLURM_ | cut -d"=" -f 1); do unset -v $i; done -for i in $(env | grep ^PMI_ | cut -d"=" -f 1); do unset -v $i; done -for i in $(env | grep ^PMIX_ | cut -d"=" -f 1); do unset -v $i; done +# DEPRECATED: examples/vlm_ptq has been consolidated into examples/llm_ptq. +# This shim forwards all arguments to the llm_ptq script with the --vlm flag so existing +# commands keep working. Please migrate to: +# +# cd examples/llm_ptq +# scripts/huggingface_example.sh --model --quant --vlm +# +# See examples/llm_ptq/README.md#vlm-quantization for details. -if [ -z "$MODEL_PATH" ]; then - echo "Unsupported model argument: Expected a huggingface model path or model name" >&2 - exit 1 -fi +set -e -case $QFORMAT in - fp8|int8_sq|int4_awq|w4a8_awq|nvfp4) - ;; - *) - echo "Unknown quant argument: Expected one of: [fp8, int8_sq, int4_awq, w4a8_awq, nvfp4]" >&2 - exit 1 -esac +echo "WARNING: examples/vlm_ptq is deprecated and will be removed in a future release." >&2 +echo " Forwarding to examples/llm_ptq/scripts/huggingface_example.sh --vlm" >&2 +echo " See examples/llm_ptq/README.md#vlm-quantization" >&2 script_dir="$(dirname "$(readlink -f "$0")")" -pushd $script_dir/.. - -if [ -z "$ROOT_SAVE_PATH" ]; then - ROOT_SAVE_PATH=$(pwd) -fi - -MODEL_NAME=$(basename $MODEL_PATH | sed 's/[^0-9a-zA-Z\-]/_/g')_${QFORMAT}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}} -SAVE_PATH=${ROOT_SAVE_PATH}/saved_models_${MODEL_NAME} - -MODEL_CONFIG=${SAVE_PATH}/config.json - -if [ "${REMOVE_EXISTING_MODEL_CONFIG,,}" = "true" ]; then - rm -f $MODEL_CONFIG -fi - -PTQ_ARGS="" - -if [ -n "$AUTO_QUANTIZE_BITS" ]; then - PTQ_ARGS+=" --auto_quantize_bits $AUTO_QUANTIZE_BITS " -fi - -if $TRUST_REMOTE_CODE; then - PTQ_ARGS+=" --trust_remote_code " -fi - -if [ -n "$KV_CACHE_QUANT" ]; then - PTQ_ARGS+=" --kv_cache_qformat=$KV_CACHE_QUANT " -fi - -if [[ "${MODEL_NAME,,}" == *"vila"* ]]; then - # Check transformers version - must be <= 4.50.0 - CURRENT_TRANSFORMERS_VERSION=$(pip show transformers | grep Version | cut -d' ' -f2) - if [ "$(printf '%s\n' "4.50.0" "$CURRENT_TRANSFORMERS_VERSION" | sort -V | head -n1)" = "4.50.0" ] && [ "$CURRENT_TRANSFORMERS_VERSION" != "4.50.0" ]; then - echo "ERROR: transformers version $CURRENT_TRANSFORMERS_VERSION is not supported." >&2 - echo "VILA requires transformers<=4.50.0" >&2 - echo "Please refer to examples/vlm_ptq/requirements-vila.txt for the supported versions." >&2 - echo "You also need to download VILA repository from https://github.com/Efficient-Large-Model/VILA.git and checkout ec7fb2c264920bf004fd9fa37f1ec36ea0942db5" >&2 - exit 1 - fi - - pip install -r ../vlm_ptq/requirements-vila.txt - # Clone original VILA repo - if [ ! -d "$(dirname "$MODEL_PATH")/VILA" ]; then - echo "VILA repository is needed until it is added to HF model zoo. Cloning the repository parallel to $MODEL_PATH..." - git clone https://github.com/Efficient-Large-Model/VILA.git "$(dirname "$MODEL_PATH")/VILA" && \ - cd "$(dirname "$MODEL_PATH")/VILA" && \ - git checkout ec7fb2c264920bf004fd9fa37f1ec36ea0942db5 && \ - cd "$script_dir/.." - fi -fi - -if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH) ]]; then - if ! [ -f $MODEL_CONFIG ]; then - echo "Quantizing original model..." - python ../llm_ptq/hf_ptq.py \ - --pyt_ckpt_path=$MODEL_PATH \ - --export_path=$SAVE_PATH \ - --qformat=$QFORMAT \ - --calib_size=$CALIB_SIZE \ - --batch_size=$CALIB_BATCH_SIZE \ - --inference_tensor_parallel=$TP \ - --inference_pipeline_parallel=$PP \ - $PTQ_ARGS - else - echo "Quantized model config $MODEL_CONFIG exists, skipping the quantization stage" - fi -fi - -if [[ "$QFORMAT" != "fp8" ]]; then - echo "For quant format $QFORMAT, please refer to the TensorRT-LLM documentation for deployment. Checkpoint saved to $SAVE_PATH." - exit 0 -fi - -if [[ "$QFORMAT" == *"nvfp4"* ]] || [[ "$KV_CACHE_QUANT" == *"nvfp4"* ]]; then - cuda_major=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader -i 0 | cut -d. -f1) - - if [ "$cuda_major" -lt 10 ]; then - echo "Please deploy the NVFP4 checkpoint on a Blackwell GPU. Checkpoint export_path: $SAVE_PATH" - exit 0 - fi -fi - -# Prepare datasets for TRT-LLM benchmark -if [ -z "$TRT_LLM_CODE_PATH" ]; then - TRT_LLM_CODE_PATH=/app/tensorrt_llm # default path for the TRT-LLM release docker image - echo "Setting default TRT_LLM_CODE_PATH to $TRT_LLM_CODE_PATH." -fi - -QUICK_START_MULTIMODAL=$TRT_LLM_CODE_PATH/examples/llm-api/quickstart_multimodal.py - -if [ -f "$QUICK_START_MULTIMODAL" ]; then - python3 $QUICK_START_MULTIMODAL --model_dir $SAVE_PATH --modality image -else - echo "Warning: $QUICK_START_MULTIMODAL cannot be found. Please set TRT_LLM_CODE_PATH to the TRT-LLM code path or test the quantized checkpoint $SAVE_PATH with the TRT-LLM repo directly." -fi - -popd +exec "$script_dir/../../llm_ptq/scripts/huggingface_example.sh" --vlm "$@"