NVIDIA · Edwardf0t1 · Jun 12, 2026 · kevalmorabia97 · Jun 13, 2026 · cjluo-nv
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -4,6 +4,10 @@ Changelog
 0.46 (2026-xx-xx)
 ^^^^^^^^^^^^^^^^^
 
+**Deprecations**
+
+- Consolidated ``examples/vlm_ptq`` into ``examples/llm_ptq``. Vision-language model PTQ now shares the ``hf_ptq.py`` entry point and ``scripts/huggingface_example.sh``; pass ``--vlm`` to bootstrap VLM-specific dependencies (e.g. VILA via ``examples/llm_ptq/requirements-vila.txt``) and run the TensorRT-LLM multimodal quickstart smoke test. The ``examples/vlm_ptq/scripts/huggingface_example.sh`` entry point is deprecated: it now prints a warning and forwards to the ``llm_ptq`` script with ``--vlm``, and will be removed in a future release. See `examples/llm_ptq/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq#vlm-quantization>`__.
+
 **New Features**
 
 - Add the ``day0-release`` agent skill (``.agents/skills/day0-release/``), a deterministic end-to-end driver that chains the PTQ → evaluation → comparison skills (the evaluation stage deploys the checkpoint itself) with an enforced gate after each stage and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Ships three GPU-free, unit-tested gate scripts (``gate_ptq.py``, ``gate_run.py``, ``gate_compare.py``) that validate checkpoint coverage, evaluation-run completeness, and baseline-vs-candidate accuracy threshold. v1 reports and stops on regression; the recipe-search loop is deferred.

@@ -102,7 +102,7 @@ more fine-grained control on installed dependencies or for alternative docker im
 
 | **Technique** | **Description** | **Examples** | **Docs** |
 | :------------: | :------------: | :------------: | :------------: |
-| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | \[[LLMs](./examples/llm_ptq/)\] \[[diffusers](./examples/diffusers/)\] \[[VLMs](./examples/vlm_ptq/)\] \[[onnx](./examples/onnx_ptq/)\] \[[windows](./examples/windows/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
+| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | \[[LLMs](./examples/llm_ptq/)\] \[[diffusers](./examples/diffusers/)\] \[[VLMs](./examples/llm_ptq/README.md#vlm-quantization)\] \[[onnx](./examples/onnx_ptq/)\] \[[windows](./examples/windows/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
 | Quantization Aware Training | Refine accuracy even further with a few training steps! | \[[Hugging Face](./examples/llm_qat/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
 | Pruning | Reduce your model size and accelerate inference by removing unnecessary weights! | \[[General](./examples/pruning/)\] \[[Megatron-Bridge](./examples/megatron_bridge/README.md#pruning)\] | |
 | Distillation | Reduce deployment model size by teaching small models to behave like larger models! | \[[Megatron-Bridge](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-bridge-framework)\] \[[Megatron-LM](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-lm-framework)\] \[[Hugging Face](./examples/llm_distill/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\] |
@@ -132,7 +132,7 @@ more fine-grained control on installed dependencies or for alternative docker im
 |------------|----------------|
 | LLM Quantization | [View Support Matrix](./examples/llm_ptq/README.md#support-matrix) |
 | Diffusers Quantization | [View Support Matrix](./examples/diffusers/README.md#support-matrix) |
-| VLM Quantization | [View Support Matrix](./examples/vlm_ptq/README.md#support-matrix) |
+| VLM Quantization | [View Support Matrix](./examples/llm_ptq/README.md#vision-language-model-vlm-supported-models) |
 | ONNX Quantization | [View Support Matrix](./examples/torch_onnx/README.md#onnx-export-supported-llm-models) |
 | Windows Quantization | [View Support Matrix](./examples/windows/README.md#support-matrix) |
 | Quantization Aware Training | [View Support Matrix](./examples/llm_qat/README.md#support-matrix) |

@@ -136,6 +136,29 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http
 
 > You can also create your own custom config using [this](https://nvidia.github.io/Model-Optimizer/guides/_pytorch_quantization.html#custom-calibration-algorithm) guide.
 
+### Vision Language Model (VLM) Supported Models
+
+PTQ for vision-language models is handled by the same `hf_ptq.py` entry point and shell script as
+LLMs — the language model is quantized while the vision encoder is kept in high precision. Pass
+`--vlm` to the shell script (see [VLM quantization](#vlm-quantization)).
+
+| Model | fp8 | int8_sq<sup>1</sup> | int4_awq | w4a8_awq<sup>2</sup> | nvfp4<sup>3</sup> |
+| :---: | :---: | :---: | :---: | :---: | :---: |
+| Llava | ✅ | ✅ | ✅ | ✅ | - |
+| VILA<sup>4</sup> | ✅ | ✅ | ✅ | ✅ | - |
+| Phi-3-vision, Phi-4-multimodal | ✅ | ✅ | ✅ | ✅ | ✅ |
+| Qwen2, 2.5-VL | ✅ | ✅ | ✅ | ✅ | ✅ |
+| Gemma3 | ✅ | - | - | - | - |
+| Nemotron VL<sup>5</sup> | ✅ | - | - | - | ✅ |
+
+> *<sup>1.</sup>Only TensorRT-LLM checkpoint export is supported. Not compatible with the TensorRT-LLM torch backend.* \
+> *<sup>2.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \
+> *<sup>3.</sup>A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.* \
+> *<sup>4.</sup>VILA requires `transformers<=4.50.0` and the original VILA repo; the shell script bootstraps both (see [`requirements-vila.txt`](./requirements-vila.txt)).* \
+> *<sup>5.</sup>Nemotron VL automatically calibrates with image-text pairs; see [VLM calibration with image-text pairs](#vlm-calibration-with-image-text-pairs-eg-nemotron-vl).*
+
+> *For detailed TensorRT-LLM torch backend multimodal support, please refer to [this doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md#multimodal-feature-support-matrix-pytorch-backend).*
+
 ## Framework Scripts
 
 ### Hugging Face Example [Script](./scripts/huggingface_example.sh)
@@ -243,6 +266,23 @@ The cast pins each NVFP4 block's `scale_2 = 2^(k_max - 8)` and `_amax = 6 * 2^k_
 
 [PTQ for DeepSeek](../deepseek/README.md) shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM.
 
+#### VLM quantization
+
+Vision-language models are quantized through the same script. Add `--vlm` so the script bootstraps
+any VLM-specific dependencies (e.g. VILA) and runs the TensorRT-LLM multimodal quickstart as the
+deploy smoke test instead of the text-only one:
+
+```bash
+scripts/huggingface_example.sh --model <Hugging Face model card or checkpoint> --quant fp8 --vlm
+```
+
+Supported `--quant` values for VLMs are `fp8`, `nvfp4`, `int8_sq`, `int4_awq`, and `w4a8_awq` (see
+[VLM Supported Models](#vision-language-model-vlm-supported-models)). For VILA models the script
+additionally installs [`requirements-vila.txt`](./requirements-vila.txt) and clones the VILA repo
+next to the checkpoint.
+
+> *This consolidates the former `examples/vlm_ptq` example, which now forwards here.*
+
 #### VLM calibration with image-text pairs (e.g., Nemotron VL)
 
 For vision-language models, calibration quality can likely improve by using image-text pairs instead of text-only data, especially on visual understanding tasks:
@@ -257,6 +297,12 @@ python hf_ptq.py \
   --calib_size 512
 ```
 
+The same flag is exposed by the shell script:
+
+```bash
+scripts/huggingface_example.sh --model <model> --quant nvfp4 --vlm --calib_with_images --trust_remote_code
+```
+
 > Note: when `--calib_with_images` is set, `--calib_size` must be a single value, and the calibration dataset is nvidia/nemotron_vlm_dataset_v2.
 This functionality is currently in beta and has been tested on `nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16`.
 

@@ -0,0 +1,9 @@
+# Extra dependencies for quantizing NVILA / VILA vision-language models.
+#
+# VILA is not yet published on the Hugging Face model zoo, so its modeling code must be
+# cloned separately (handled automatically by scripts/huggingface_example.sh --vlm) from
+# https://github.com/Efficient-Large-Model/VILA.git at commit
+# ec7fb2c264920bf004fd9fa37f1ec36ea0942db5.
+#
+# VILA's modeling code is only compatible with transformers<=4.50.0.
+transformers<=4.50.0
@@ -84,8 +84,36 @@ if [ "${REMOVE_EXISTING_MODEL_CONFIG,,}" = "true" ]; then
     rm -f $MODEL_CONFIG
 fi
 
+# VILA vision-language models are not yet on the HF model zoo and require the original
+# VILA repo plus an older transformers. Only triggered for VLM runs (--vlm) on VILA models.
+if $VLM && [[ "${MODEL_NAME,,}" == *"vila"* ]]; then
+    # Check transformers version - must be <= 4.50.0
+    CURRENT_TRANSFORMERS_VERSION=$(pip show transformers | grep Version | cut -d' ' -f2)
+    if [ "$(printf '%s\n' "4.50.0" "$CURRENT_TRANSFORMERS_VERSION" | sort -V | head -n1)" = "4.50.0" ] && [ "$CURRENT_TRANSFORMERS_VERSION" != "4.50.0" ]; then
+        echo "ERROR: transformers version $CURRENT_TRANSFORMERS_VERSION is not supported." >&2
+        echo "VILA requires transformers<=4.50.0" >&2
+        echo "Please refer to examples/llm_ptq/requirements-vila.txt for the supported versions." >&2
+        echo "You also need to download VILA repository from https://github.com/Efficient-Large-Model/VILA.git and checkout ec7fb2c264920bf004fd9fa37f1ec36ea0942db5" >&2
+        exit 1
+    fi
+
+    pip install -r requirements-vila.txt
+    # Clone original VILA repo
+    if [ ! -d "$(dirname "$MODEL_PATH")/VILA" ]; then
+        echo "VILA repository is needed until it is added to HF model zoo. Cloning the repository parallel to $MODEL_PATH..."
+        git clone https://github.com/Efficient-Large-Model/VILA.git "$(dirname "$MODEL_PATH")/VILA" && \
+	cd "$(dirname "$MODEL_PATH")/VILA" && \
+	git checkout ec7fb2c264920bf004fd9fa37f1ec36ea0942db5 && \
+	cd "$script_dir/.."
+    fi
+fi
+
 PTQ_ARGS=""
 
+if $CALIB_WITH_IMAGES; then
+    PTQ_ARGS+=" --calib_with_images "
+fi
+
 if [ "$LOW_MEMORY_MODE" = "true" ]; then
     PTQ_ARGS+=" --low_memory_mode "
 fi
@@ -223,7 +251,21 @@ if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH
     # Only run the deploy+generate smoke test when "quant" is explicitly requested. Eval tasks
     # (lm_eval/mmlu/simple_eval) deploy the checkpoint themselves, so it is redundant there.
     if [[ $TASKS =~ "quant" ]]; then
-        python run_tensorrt_llm.py --checkpoint_dir=$SAVE_PATH $RUN_ARGS
+        if $VLM; then
+            # VLMs use the TRT-LLM multimodal quickstart for the deploy smoke test.
+            if [ -z "$TRT_LLM_CODE_PATH" ]; then
+                TRT_LLM_CODE_PATH=/app/tensorrt_llm # default path for the TRT-LLM release docker image
+                echo "Setting default TRT_LLM_CODE_PATH to $TRT_LLM_CODE_PATH."
+            fi
+            QUICK_START_MULTIMODAL=$TRT_LLM_CODE_PATH/examples/llm-api/quickstart_multimodal.py
+            if [ -f "$QUICK_START_MULTIMODAL" ]; then
+                python3 $QUICK_START_MULTIMODAL --model_dir $SAVE_PATH --modality image
+            else
+                echo "Warning: $QUICK_START_MULTIMODAL cannot be found. Please set TRT_LLM_CODE_PATH to the TRT-LLM code path or test the quantized checkpoint $SAVE_PATH with the TRT-LLM repo directly."
+            fi
+        else
+            python run_tensorrt_llm.py --checkpoint_dir=$SAVE_PATH $RUN_ARGS
+        fi
     fi
 fi
 

@@ -37,9 +37,11 @@ parse_options() {
     VERBOSE=true
     USE_SEQ_DEVICE_MAP=false
     CAST_MXFP4_TO_NVFP4=false
+    VLM=false
+    CALIB_WITH_IMAGES=false
 
   # Parse command-line options
-  ARGS=$(getopt -o "" -l "model:,quant:,recipe:,kv_cache_quant:,tp:,pp:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,simple_eval_limit:,mmlu_limit:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:,calib_seq:,auto_quantize_method:,auto_quantize_score_size:,auto_quantize_checkpoint:,moe_calib_experts_ratio:,cast_mxfp4_to_nvfp4" -n "$0" -- "$@")
+  ARGS=$(getopt -o "" -l "model:,quant:,recipe:,kv_cache_quant:,tp:,pp:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,simple_eval_limit:,mmlu_limit:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:,calib_seq:,auto_quantize_method:,auto_quantize_score_size:,auto_quantize_checkpoint:,moe_calib_experts_ratio:,cast_mxfp4_to_nvfp4,vlm,calib_with_images" -n "$0" -- "$@")
 
   eval set -- "$ARGS"
   while true; do
@@ -76,6 +78,8 @@ parse_options() {
       --auto_quantize_checkpoint ) AUTO_QUANTIZE_CHECKPOINT="$2"; shift 2;;
       --moe_calib_experts_ratio ) MOE_CALIB_EXPERTS_RATIO="$2"; shift 2;;
       --cast_mxfp4_to_nvfp4 ) CAST_MXFP4_TO_NVFP4=true; shift;;
+      --vlm ) VLM=true; shift;;
+      --calib_with_images ) CALIB_WITH_IMAGES=true; shift;;
       -- ) shift; break ;;
       * ) break ;;
     esac
@@ -176,5 +180,7 @@ parse_options() {
   echo "auto_quantize_checkpoint: $AUTO_QUANTIZE_CHECKPOINT"
   echo "moe_calib_experts_ratio: $MOE_CALIB_EXPERTS_RATIO"
   echo "cast_mxfp4_to_nvfp4: $CAST_MXFP4_TO_NVFP4"
+  echo "vlm: $VLM"
+  echo "calib_with_images: $CALIB_WITH_IMAGES"
   echo "================="
 }