Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ Changelog
0.46 (2026-xx-xx)
^^^^^^^^^^^^^^^^^

**Deprecations**

- Consolidated ``examples/vlm_ptq`` into ``examples/llm_ptq``. Vision-language model PTQ now shares the ``hf_ptq.py`` entry point and ``scripts/huggingface_example.sh``; pass ``--vlm`` to bootstrap VLM-specific dependencies (e.g. VILA via ``examples/llm_ptq/requirements-vila.txt``) and run the TensorRT-LLM multimodal quickstart smoke test. The ``examples/vlm_ptq/scripts/huggingface_example.sh`` entry point is deprecated: it now prints a warning and forwards to the ``llm_ptq`` script with ``--vlm``, and will be removed in a future release. See `examples/llm_ptq/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq#vlm-quantization>`__.

**New Features**

- Add the ``day0-release`` agent skill (``.agents/skills/day0-release/``), a deterministic end-to-end driver that chains the PTQ → evaluation → comparison skills (the evaluation stage deploys the checkpoint itself) with an enforced gate after each stage and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Ships three GPU-free, unit-tested gate scripts (``gate_ptq.py``, ``gate_run.py``, ``gate_compare.py``) that validate checkpoint coverage, evaluation-run completeness, and baseline-vs-candidate accuracy threshold. v1 reports and stops on regression; the recipe-search loop is deferred.
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ more fine-grained control on installed dependencies or for alternative docker im

| **Technique** | **Description** | **Examples** | **Docs** |
| :------------: | :------------: | :------------: | :------------: |
| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | \[[LLMs](./examples/llm_ptq/)\] \[[diffusers](./examples/diffusers/)\] \[[VLMs](./examples/vlm_ptq/)\] \[[onnx](./examples/onnx_ptq/)\] \[[windows](./examples/windows/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | \[[LLMs](./examples/llm_ptq/)\] \[[diffusers](./examples/diffusers/)\] \[[VLMs](./examples/llm_ptq/README.md#vlm-quantization)\] \[[onnx](./examples/onnx_ptq/)\] \[[windows](./examples/windows/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Quantization Aware Training | Refine accuracy even further with a few training steps! | \[[Hugging Face](./examples/llm_qat/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Pruning | Reduce your model size and accelerate inference by removing unnecessary weights! | \[[General](./examples/pruning/)\] \[[Megatron-Bridge](./examples/megatron_bridge/README.md#pruning)\] | |
| Distillation | Reduce deployment model size by teaching small models to behave like larger models! | \[[Megatron-Bridge](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-bridge-framework)\] \[[Megatron-LM](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-lm-framework)\] \[[Hugging Face](./examples/llm_distill/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\] |
Expand Down Expand Up @@ -132,7 +132,7 @@ more fine-grained control on installed dependencies or for alternative docker im
|------------|----------------|
| LLM Quantization | [View Support Matrix](./examples/llm_ptq/README.md#support-matrix) |
| Diffusers Quantization | [View Support Matrix](./examples/diffusers/README.md#support-matrix) |
| VLM Quantization | [View Support Matrix](./examples/vlm_ptq/README.md#support-matrix) |
| VLM Quantization | [View Support Matrix](./examples/llm_ptq/README.md#vision-language-model-vlm-supported-models) |
| ONNX Quantization | [View Support Matrix](./examples/torch_onnx/README.md#onnx-export-supported-llm-models) |
| Windows Quantization | [View Support Matrix](./examples/windows/README.md#support-matrix) |
| Quantization Aware Training | [View Support Matrix](./examples/llm_qat/README.md#support-matrix) |
Expand Down
46 changes: 46 additions & 0 deletions examples/llm_ptq/README.md

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to rename examples/llm_ptq to examples/hf_ptq?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good idea. Though not sure if it will be a breaking change

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave a symlink from examples/llm_ptq/ to new examples/hf_ptq/ directory so previous path still remains valid and then we remove the symlink folder after few releases?

Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,29 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http

> You can also create your own custom config using [this](https://nvidia.github.io/Model-Optimizer/guides/_pytorch_quantization.html#custom-calibration-algorithm) guide.

### Vision Language Model (VLM) Supported Models

PTQ for vision-language models is handled by the same `hf_ptq.py` entry point and shell script as
LLMs — the language model is quantized while the vision encoder is kept in high precision. Pass
`--vlm` to the shell script (see [VLM quantization](#vlm-quantization)).

| Model | fp8 | int8_sq<sup>1</sup> | int4_awq | w4a8_awq<sup>2</sup> | nvfp4<sup>3</sup> |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| :---: | :---: | :---: | :---: | :---: | :---: |
| Llava | ✅ | ✅ | ✅ | ✅ | - |
| VILA<sup>4</sup> | ✅ | ✅ | ✅ | ✅ | - |
| Phi-3-vision, Phi-4-multimodal | ✅ | ✅ | ✅ | ✅ | ✅ |
| Qwen2, 2.5-VL | ✅ | ✅ | ✅ | ✅ | ✅ |
| Gemma3 | ✅ | - | - | - | - |
| Nemotron VL<sup>5</sup> | ✅ | - | - | - | ✅ |

> *<sup>1.</sup>Only TensorRT-LLM checkpoint export is supported. Not compatible with the TensorRT-LLM torch backend.* \
> *<sup>2.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \
> *<sup>3.</sup>A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.* \
> *<sup>4.</sup>VILA requires `transformers<=4.50.0` and the original VILA repo; the shell script bootstraps both (see [`requirements-vila.txt`](./requirements-vila.txt)).* \
> *<sup>5.</sup>Nemotron VL automatically calibrates with image-text pairs; see [VLM calibration with image-text pairs](#vlm-calibration-with-image-text-pairs-eg-nemotron-vl).*

> *For detailed TensorRT-LLM torch backend multimodal support, please refer to [this doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md#multimodal-feature-support-matrix-pytorch-backend).*

## Framework Scripts

### Hugging Face Example [Script](./scripts/huggingface_example.sh)
Expand Down Expand Up @@ -243,6 +266,23 @@ The cast pins each NVFP4 block's `scale_2 = 2^(k_max - 8)` and `_amax = 6 * 2^k_

[PTQ for DeepSeek](../deepseek/README.md) shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM.

#### VLM quantization

Vision-language models are quantized through the same script. Add `--vlm` so the script bootstraps
any VLM-specific dependencies (e.g. VILA) and runs the TensorRT-LLM multimodal quickstart as the
deploy smoke test instead of the text-only one:

```bash
scripts/huggingface_example.sh --model <Hugging Face model card or checkpoint> --quant fp8 --vlm
```

Supported `--quant` values for VLMs are `fp8`, `nvfp4`, `int8_sq`, `int4_awq`, and `w4a8_awq` (see
[VLM Supported Models](#vision-language-model-vlm-supported-models)). For VILA models the script
additionally installs [`requirements-vila.txt`](./requirements-vila.txt) and clones the VILA repo
next to the checkpoint.

> *This consolidates the former `examples/vlm_ptq` example, which now forwards here.*

#### VLM calibration with image-text pairs (e.g., Nemotron VL)

For vision-language models, calibration quality can likely improve by using image-text pairs instead of text-only data, especially on visual understanding tasks:
Expand All @@ -257,6 +297,12 @@ python hf_ptq.py \
--calib_size 512
```

The same flag is exposed by the shell script:

```bash
scripts/huggingface_example.sh --model <model> --quant nvfp4 --vlm --calib_with_images --trust_remote_code
```

> Note: when `--calib_with_images` is set, `--calib_size` must be a single value, and the calibration dataset is nvidia/nemotron_vlm_dataset_v2.
This functionality is currently in beta and has been tested on `nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16`.

Expand Down
9 changes: 9 additions & 0 deletions examples/llm_ptq/requirements-vila.txt

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to drop Vila model support? ModelOpt min transformers is 4.56 so we cannot continue guaranteeing it works with 4.50

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Extra dependencies for quantizing NVILA / VILA vision-language models.
#
# VILA is not yet published on the Hugging Face model zoo, so its modeling code must be
# cloned separately (handled automatically by scripts/huggingface_example.sh --vlm) from
# https://github.com/Efficient-Large-Model/VILA.git at commit
# ec7fb2c264920bf004fd9fa37f1ec36ea0942db5.
#
# VILA's modeling code is only compatible with transformers<=4.50.0.
transformers<=4.50.0
44 changes: 43 additions & 1 deletion examples/llm_ptq/scripts/huggingface_example.sh
Original file line number Diff line number Diff line change
Expand Up @@ -84,8 +84,36 @@ if [ "${REMOVE_EXISTING_MODEL_CONFIG,,}" = "true" ]; then
rm -f $MODEL_CONFIG
fi

# VILA vision-language models are not yet on the HF model zoo and require the original
# VILA repo plus an older transformers. Only triggered for VLM runs (--vlm) on VILA models.
if $VLM && [[ "${MODEL_NAME,,}" == *"vila"* ]]; then
# Check transformers version - must be <= 4.50.0
CURRENT_TRANSFORMERS_VERSION=$(pip show transformers | grep Version | cut -d' ' -f2)
if [ "$(printf '%s\n' "4.50.0" "$CURRENT_TRANSFORMERS_VERSION" | sort -V | head -n1)" = "4.50.0" ] && [ "$CURRENT_TRANSFORMERS_VERSION" != "4.50.0" ]; then
echo "ERROR: transformers version $CURRENT_TRANSFORMERS_VERSION is not supported." >&2
echo "VILA requires transformers<=4.50.0" >&2
echo "Please refer to examples/llm_ptq/requirements-vila.txt for the supported versions." >&2
echo "You also need to download VILA repository from https://github.com/Efficient-Large-Model/VILA.git and checkout ec7fb2c264920bf004fd9fa37f1ec36ea0942db5" >&2
exit 1
fi

pip install -r requirements-vila.txt
# Clone original VILA repo
if [ ! -d "$(dirname "$MODEL_PATH")/VILA" ]; then
echo "VILA repository is needed until it is added to HF model zoo. Cloning the repository parallel to $MODEL_PATH..."
git clone https://github.com/Efficient-Large-Model/VILA.git "$(dirname "$MODEL_PATH")/VILA" && \
cd "$(dirname "$MODEL_PATH")/VILA" && \
git checkout ec7fb2c264920bf004fd9fa37f1ec36ea0942db5 && \
cd "$script_dir/.."
fi
fi

PTQ_ARGS=""

if $CALIB_WITH_IMAGES; then
PTQ_ARGS+=" --calib_with_images "
fi

if [ "$LOW_MEMORY_MODE" = "true" ]; then
PTQ_ARGS+=" --low_memory_mode "
fi
Expand Down Expand Up @@ -223,7 +251,21 @@ if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH
# Only run the deploy+generate smoke test when "quant" is explicitly requested. Eval tasks
# (lm_eval/mmlu/simple_eval) deploy the checkpoint themselves, so it is redundant there.
if [[ $TASKS =~ "quant" ]]; then
python run_tensorrt_llm.py --checkpoint_dir=$SAVE_PATH $RUN_ARGS
if $VLM; then
# VLMs use the TRT-LLM multimodal quickstart for the deploy smoke test.
if [ -z "$TRT_LLM_CODE_PATH" ]; then
TRT_LLM_CODE_PATH=/app/tensorrt_llm # default path for the TRT-LLM release docker image
echo "Setting default TRT_LLM_CODE_PATH to $TRT_LLM_CODE_PATH."
fi
QUICK_START_MULTIMODAL=$TRT_LLM_CODE_PATH/examples/llm-api/quickstart_multimodal.py
if [ -f "$QUICK_START_MULTIMODAL" ]; then
python3 $QUICK_START_MULTIMODAL --model_dir $SAVE_PATH --modality image
else
Comment on lines +262 to +263

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Quote smoke-test paths to avoid argument splitting.

Line 262 and Line 267 expand path variables unquoted, so a save/code path containing spaces can break the Python command arguments.

Proposed fix
-                python3 $QUICK_START_MULTIMODAL --model_dir $SAVE_PATH --modality image
+                python3 "$QUICK_START_MULTIMODAL" --model_dir "$SAVE_PATH" --modality image
...
-            python run_tensorrt_llm.py --checkpoint_dir=$SAVE_PATH $RUN_ARGS
+            python run_tensorrt_llm.py --checkpoint_dir="$SAVE_PATH" $RUN_ARGS

Also applies to: 267-267

🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 262-262: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 262-262: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/scripts/huggingface_example.sh` around lines 262 - 263, The
shell invocation expands QUICK_START_MULTIMODAL and SAVE_PATH unquoted which
breaks when paths contain spaces; update the python3 command invocations (the
lines that call python3 with QUICK_START_MULTIMODAL and the --model_dir
SAVE_PATH flag) to quote those expansions (e.g., wrap QUICK_START_MULTIMODAL and
SAVE_PATH in double quotes) and also quote any other path-like variables used in
the alternate branch at line 267 so arguments aren’t split.

Source: Linters/SAST tools

echo "Warning: $QUICK_START_MULTIMODAL cannot be found. Please set TRT_LLM_CODE_PATH to the TRT-LLM code path or test the quantized checkpoint $SAVE_PATH with the TRT-LLM repo directly."
fi
else
python run_tensorrt_llm.py --checkpoint_dir=$SAVE_PATH $RUN_ARGS
fi
fi
fi

Expand Down
8 changes: 7 additions & 1 deletion examples/llm_ptq/scripts/parser.sh
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,11 @@ parse_options() {
VERBOSE=true
USE_SEQ_DEVICE_MAP=false
CAST_MXFP4_TO_NVFP4=false
VLM=false
CALIB_WITH_IMAGES=false

# Parse command-line options
ARGS=$(getopt -o "" -l "model:,quant:,recipe:,kv_cache_quant:,tp:,pp:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,simple_eval_limit:,mmlu_limit:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:,calib_seq:,auto_quantize_method:,auto_quantize_score_size:,auto_quantize_checkpoint:,moe_calib_experts_ratio:,cast_mxfp4_to_nvfp4" -n "$0" -- "$@")
ARGS=$(getopt -o "" -l "model:,quant:,recipe:,kv_cache_quant:,tp:,pp:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,simple_eval_limit:,mmlu_limit:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:,calib_seq:,auto_quantize_method:,auto_quantize_score_size:,auto_quantize_checkpoint:,moe_calib_experts_ratio:,cast_mxfp4_to_nvfp4,vlm,calib_with_images" -n "$0" -- "$@")

eval set -- "$ARGS"
while true; do
Expand Down Expand Up @@ -76,6 +78,8 @@ parse_options() {
--auto_quantize_checkpoint ) AUTO_QUANTIZE_CHECKPOINT="$2"; shift 2;;
--moe_calib_experts_ratio ) MOE_CALIB_EXPERTS_RATIO="$2"; shift 2;;
--cast_mxfp4_to_nvfp4 ) CAST_MXFP4_TO_NVFP4=true; shift;;
--vlm ) VLM=true; shift;;
--calib_with_images ) CALIB_WITH_IMAGES=true; shift;;
-- ) shift; break ;;
* ) break ;;
esac
Expand Down Expand Up @@ -176,5 +180,7 @@ parse_options() {
echo "auto_quantize_checkpoint: $AUTO_QUANTIZE_CHECKPOINT"
echo "moe_calib_experts_ratio: $MOE_CALIB_EXPERTS_RATIO"
echo "cast_mxfp4_to_nvfp4: $CAST_MXFP4_TO_NVFP4"
echo "vlm: $VLM"
echo "calib_with_images: $CALIB_WITH_IMAGES"
echo "================="
}
Loading
Loading