Releases: NVIDIA/TensorRT-LLM
Releases · NVIDIA/TensorRT-LLM
v1.3.0rc13
Highlights
-
Model Support
- Support and initial optimizations for Nemotron 3 Nano Omni; known issues for audio-from-video and chunked prefill for video being actively worked on
- Add audio extraction from video, optimize ViT attention, and reduce initialization memory for Nemotron and Nemotron Nano VL models (#12921, #12911, #13283)
- Add per-model VisualGen example scripts, shared configs, per-model defaults, and metadata updates (#12992, #12862)
- Add GLM-4.7 and GLM-5 tool parser support (#13150)
- Optimize Nemotron-H execution from the Python layer and preserve Nemotron HF mamba cache dtype during bench tuning (#13032, #12826)
- Improve DeepSeek-V3.2 and DeepSeek-V3-Lite support with targeted perf and chunked-prefill fixes on Blackwell and SM100-class GPUs (#13142, #13257)
-
API
- Fix the chunked prefill API contract for Nemotron Nano VL (#13025)
- Add abort and resume support for Async RL in verl (#12272)
- Add a modular logger with automatic module detection and per-module filtering (#13202)
- Improve prompt handling by accounting for existing multimodal placeholder tokens in text prompts (#12827)
- Propagate real server-side failures to disaggregated serving clients and improve empty-file handling in trtllm-bench (#13119, #12552)
-
Feature
- Add VisualGen Cache-DiT and a unified cache accelerator (#12548)
- Expand kernel support with broader RMSNorm coverage, optimized causal-conv1d prefill and decode, FP4 residual quantization, and refreshed SageAttention kernels (#13033, #13103, #13117, #12937)
- Add batched addSequence with two-phase claim and unified VSWA and non-reuse support (#13029)
- Add sparse MQA and GQA attention support and introduce new sharding infrastructure (#12470, #12419)
- Improve serving performance with async media loading, faster video frame decoding, cached text computation reuse, lower custom-op overhead, padding-aware CUDA graph tuning, and reduced single-rank broadcast overhead (#13034, #12677, #13149, #12895, #13412, #13259, #11640)
- Optimize runtime internals with Minimax RMSNorm tuning, consolidated prefix-reuse analysis, gen-only sync transfer v2, DWDP contention config cleanup, and round-robin CP cache transmission (#12163, #13095, #12882, #12974, #13180)
- Restore EAGLE3 dynamic-tree speculative decoding support and centralize perfect-router integration and validation (#13081, #13250)
-
Fix
- Fix KV cache and scheduler correctness issues, including SWA compatibility, token accounting with context chunking, over-allocation in VSWA plus EAGLE flows, KVCacheManagerV2 bugs, and multimodal and disaggregated cache reuse problems (#12968, #12976, #12855, #12306, #13104, #12472)
- Fix runtime stability issues by preventing benchmark fill-loop hangs, tightening warmup reservation behavior, and making host-memory-based prefetch decisions consistent across ranks (#13065, #13078, #13161)
- Fix EAGLE3 LoRA speculative decoding and preserve speculative layer weights to avoid MTP plus PP hangs (#13005, #12555)
- Fix FMHA and attention runtime issues, including SM90 full-mask skip-softmax dispatch, misleading generation warnings, stale CUDA graphs on beam-width changes, and FlashInfer KV layout handling (#13120, #13157, #13255, #13190)
- Fix vision and multimodal correctness issues, including KV-cache quantization leaks into the vision encoder, FLUX high-resolution scheduler off-by-one behavior, and Super V3 multi-stream MoE instability (#13181, #13091, #13122)
- Fix packaging and environment issues by restoring the missing aarch64 library, enforcing NCCL >= 2.28 at configure time, and using weights_only=True in LoRA manager loads (#13206, #13108, #13391)
- Fix operational reliability issues in CI and perf pipelines, including OpenSearch upload failures, hanging AIPerf metrics, SLURM host name propagation, and SLURM submission retry behavior (#13215, #13314, #13367, #12778)
- Fix additional model and runtime issues for Qwen3 mrope cache handling, DSA illegal memory access with CUDA graph plus host KV offload, stale tokenizer alias imports, and WAN example timing conflicts (#13269, #13124, #13086, #13193, #12128)
-
Documentation
-
Test & Infra
- Add Dynamo API compatibility tests, VisualGen regression coverage, and refactor MoE communication tests (#12970, #13372, #12841)
- Expand CI coverage for disaggregated serving and weekly performance suites, including K2.5 EPLB coverage, refreshed Nemotron datasets, and additional weekly perf models (#13185, #12982, #13325)
- Improve CI signal quality by splitting multimodal DGX_B200 jobs, removing obsolete or low-priority cases, dropping non-key-model L0 coverage, and moving bf16 and auto precision variants to post-merge (#12978, #13262, #13374, #13315, #13366)
- Improve CI tooling with PR-aware failure analysis, SwiftStack upload support, wildcard bot stage commands, a sync_qa_tests Jenkins script, doc tests, and markdown-only doc-build rules (#12849, #13291, #12881, #13028, #13152, #13358, #13441)
- Refresh repository ownership and security plumbing with CODEOWNERS updates, HMAC key enforcement, and container vulnerability fixes (#13110, #13213, #9850, #13447)
What's Changed
- [https://nvbugs/5997092][fix] Remove waives for DS-V3.2/R1 FP4 Blackkwell perf tests by @peihu-nv in #13042
- [None][infra] Waive 2 failed cases for main in post-merge by @xinhe-nv in #13105
- [TRTLLM-9132][infra] Update to ignore failure for release check and building images by @EmmaQiaoCh in #9871
- [https://nvbugs/5626259][fix] Enable nemotron-h chunk prefill test by @Wanli-Jiang in #12980
- [None][feat] Add the invocation path for mamba2 mtp custom op by @JadoTu in #12787
- [None][infra] Waive 4 failed cases for main in post-merge 2654 by @ZhanruiSunCh in #13113
- [None][infra] Waive 3 failed cases for main in post-merge 2658 by @ZhanruiSunCh in #13141
- [None][chore] Add CODEOWNERS mappings for @NVIDIA/trt-llm-multimodal-devs by @venkywonka in #13110
- [None][chore] Add disaggregated tests that timeout to waives.txt by @2ez4bz in #13136
- [https://nvbugs/5844149][fix] Fix issues with DSV3.2 perf tests by @chenfeiz0326 in #13142
- [None][fix] Fix a capacity issue in KVCacheManagerV2 for SWA compatibility by @heyuhhh in #12968
- [https://nvbugs/6044213][chore] unwaive and reduce free mem ratio in AutoDeploy's perf test: deepseek_r1_distill_qwen_32b by @MrGeva in #12965
- [None][fix] Fix chunked prefill API contract for nemotron nano VL by @2ez4bz in #13025
- [TRTLLM-11794][feat] Optimize ViT Attention kernel on Nemotron by @yechank-nvidia in #12911
- [TRTLLMINF-38][feat] Pass PR number to CI failure analysis agent by @dpitman-nvda in #12849
- [https://nvbugs/6074784][chore] Temp waive dis-agg transformers failed tests by @Shixiaowei02 in #13145
- [None][fix] Fix scheduler off-by-one in FLUX pipelines at high resolutions by @karljang in #13091
- [None][infra] Add 5 users to blossom-ci allowlist by @yuanjingx87 in #13146
- [TRTLLM-11403][feat] VisualGen Cache-DiT + unified cache accelerator by @o-stoner in #12548
- [None][fix] Enable LoRA in EAGLE3 speculative decoding by @Funatiq in #13005
- [TRTLLM-11903][test] Add API compatibility tests for dynamo by @brb-nv in #12970
- [None][feat] Update rms_norm + fp4_qaunt kernel supporting more dim by @Wanli-Jiang in #13033
- [None][chore] Bump version to 1.3.0rc13 by @VALLIS-NERIA in #13159
- [None][fix] Fix compute token accounting for KV cache reuse with context chunking by @lancelly in #12976
- [None][feat] Batch addSequence with two-phase claim and unified VSWA/non-reuse support by @liji-nv in #13029
- [None][bug] fix SM90 full-mask skip-softmax dispatch by @bobboli in #13120
- [None][test] Refactor MoE comm tests: unified dispatch+combine pipeline by @xxi-nv in #12841
- [https://nvbugs/5983320][fix] Use encoder_max_batch_size of 1 for LLaVa in test_multi_request_batch_chat by @moraxu in #12647
- [TRTLLM-11771][feat] Add audio extraction from video for Nemotron Nano VL by @2ez4bz in #12921
- [None][fix] Update stale TOKENIZER_ALIASES import path in serve and bench modules by @cascade812 in #13086
- [TRTLLM-11695][feat] Add per-model VisualGen example scripts, shared configs, and per-model defaults by @zhenhuaw-me in #12992
- [https://nvbugs/6060119][chore] Unwaive DSR1 FP4 128k8k disagg perf tests by @peihu-nv in #13088
- [None][feat] Support sparse mqa/gqa attention by @heyuhh...
v1.3.0rc12
Highlights
-
Model Support
- Add LTX-2 two-stage pipeline support (#12361)
- Add CUDA graph support for LTX-2 with
torch.compilecompatibility (#12653) - Add video temporal compression for Nemotron Nano and RADIO (#12649)
- Extend the Python cache transceiver to support Qwen-Next (#12772)
- Add CuteDSL MoE backend support for Qwen3.5 (#12799)
- Fix LoRA support for Qwen3 models (#12785)
- Support loading FP8 LoRA weight files (#12848)
- Add support for speculative decoding with LoRA (#12661)
- Fix OOM with large numbers of LoRA adapters (#12815)
- Partially fix LoRA overallocation for Nemotron NAS (#12817)
- Skip
inference_mode()whentorch.compile=Truefor Gemma3 FP8 (#12367) - Skip NVFP4 fused norm when the dimension does not meet requirements (#12901)
- Update MoE
hidden_sizein the communicator for Nemotron-H (#12890) - Unify image-as-tensor handling to avoid repeated conversions for nano models (#12994)
-
API
- Refine the VisualGen API structure (#12807)
- Convert
VisualGenParamsto Pydantic with request validation, per-model defaults, andextra_paramssupport (#12922) - Align
AttentionPluginwith the EdgeLLM interface (#12233) - Add shorthand
KVConnectorpaths forlmcacheandkvbm(#12626) - Add the missing
allow_partial_loadingparameter to CuteDSL and ConfigurableMoEload_weights(#12761) - Improve KV cache statistics monitoring (#12413)
-
Feature
- Add NvTelemetry/GXT-compliant usage telemetry (#12384)
- Add production-level Prometheus metrics for iteration stats, config info, token counters, and phase histograms (#12545)
- Add conversation-affinity routing for disaggregated serving (#12526)
- Enable block reuse with the overlap scheduler (#12816)
- Unify VisualGen parallelism (#12509)
- Consolidate piecewise CUDA graph VLM updates (#12852)
- Add tunable NVFP4 quantization with an additional FlashInfer backend (#12126)
- Optimize GDN prefill with indexed in-kernel state updates (#12791)
-
Fix
- Propagate
disaggregated_paramsthroughPostprocWorker(#12513) - Prebuild disaggregated context responses to avoid
ctx_request_idraces (#12466) - Generate HMAC keys for MGMN IPC servers in disaggregated serving (#12670)
- Enable HMAC authentication in VisualGen ZMQ IPC channels (#12680)
- Fix disaggregated gen-only hangs caused by blocking KV transfers (#12640)
- Replace busy-poll sleep in
get_async_noblockwith the ZMQ async poller (#12189) - Make
trust_remote_codeopt-in inMultimodalModelRunner(#12669) - Fix VLM guided decoding startup crashes caused by missing
vocab_size_padded(#12284) - Eliminate double PNG encoding in visual generation serving (#12903)
- Treat whitespace-only content correctly in nano-v3 reasoning swap (#12912)
- Clamp
usedNumBlocksto non-negative values in KV cache statistics (#11922) - Fix
moe_chunking_tokenshandling during MoE A2A (#12929) - Guard CUDA event
elapsed_timeinperf_metrics_managerto prevent executor crashes (#12868) - Remove leftover
onboardBlocksparameters inkvCacheManagerTest(#13107) - Add CUDA device setup before
load_remote_agent(#12619) - Fix Mooncake transfer agent binding (#12723)
- Fix
multi_stream_moeaccuracy with MLIR and piecewise CUDA graphs (#12847) - Fix Nano chunked prefill (#12782)
- Fix constrained decoding for GLM5 (#12869)
- Fix benchmark disaggregated deadlocks by removing a blocking fill loop (#12208)
- Update CUTLASS C++ to 4.4.2 (#12897)
- Pin Ray to 2.54.1 (#13071)
- Propagate
-
Documentation
-
Benchmark
- Optimize the Qwen3.5 decode delta kernel (#12740)
- Reduce host overhead in DSA MLA attention (#12631)
- Add a host performance regression test suite for PyExecutor (#12148)
- Add benchmark coverage for allreduce backends (#12887)
- Restore DSR1/DSV32/K2 disaggregated performance tests (#12688)
- Support NV SA benchmarks in CI performance testing (#13004)
- Add K2.5 performance tests into CI (#12931)
-
Test & Infra
- Update Perf Sanity System code paths (#12430)
- Bump etcd to 3.6.9 to pick up the gRPC fix (#12594)
- Fix the PLC nightly pipeline and expose more pipeline data (#12940)
- Exclude QA nodes when running TRTLLM CI (#13102)
- Add a unit test for lifecycle race condition errors in disaggregated serving (#12803)
- Add an end-to-end test for PP + disagg + block reuse + chunked prefill hangs (#12913)
- Add Nemotron-3-Super-120B-A12B-NVFP4 functional and performance cases on DGX Spark (#12830)
- Remove obsolete RTX-6000 OOM tests (#12800)
- Remove unused tests (#12625)
- Check unused fixtures (#12730)
- Fix Qwen3 skip-softmax attention CI tests (#12789)
- Fix failing KV cache transceiver tests from the perf sanity changes (#12554)
- Fix Wan unit tests (#13026)
- Remove obsolete waivers (#12979)
- Move the
PY312-UB2404sanity check test to A100X nodes (#13077) - Pin Ray to 2.54.1 in the Slurm CI stage (#13085)
What's Changed
- [None][test] Unwaive Nemotron H flaky case by @nv-guomingz in #11236
- [https://nvbugs/5997543][fix] unwaive test_disaggregated_overlap_transceiver_runtime_python by @chuangz0 in #12580
- [TRTLLM-11574][feat] Some updates on Perf Sanity System codes by @chenfeiz0326 in #12430
- [None][doc] add attention developer guide by @QiJune in #12693
- [https://nvbugs/5991957][fix] Propagate disaggregated_params through PostprocWorker by @peihu-nv in #12513
- [https://nvbugs/5883590][fix] Generate HMAC key for MGMN IPC server in disaggregated serving by @yibinl-nvidia in #12670
- [https://nvbugs/5941242][fix] Fix SigLIP test failure by @tijyojwad in #12717
- [None][feat] Optimize qwen3.5 decode delta kernel by @nv-guomingz in #12740
- [https://nvbugs/5961736][fix] Prebuild disagg ctx response to avoid ctx_request_id race by @peihu-nv in #12466
- [https://nvbugs/5922880][fix] Enable HMAC authentication in VisualGen ZMQ IPC channels by @yibinl-nvidia in #12680
- [None][fix] Add missing allow_partial_loading param to CuteDSL and ConfigurableMoE load_weights by @qiaoxj07 in #12761
- [None][chore] Waive hanging Nemotron Super test by @brb-nv in #12821
- [None][fix] add cuda set device before load_remote_agent by @chuangz0 in #12619
- [None][chore] Remove closed bugs by @xinhe-nv in #12766
- [None][test] Remove RTX-6000 OOM test cases by @yufeiwu-nv in #12800
- [None][fix] Fix LoRA support for Qwen3 models by @achartier in #12785
- [TRTLLM-11343][feat] LTX-2 Two Stage pipeline support by @yibinl-nvidia in #12361
- [#12808][feat] AutoDeploy: Add Gemma4 Support by @bmarimuthu-nv in #12710
- [None][feat] Add Claude Code agents and skills for kernel dev, perf analysis, and compilation by @kaiyux in #12831
- [#11879][fix] Clamp usedNumBlocks to non-negative in KV cache stats by @wojciech-wais in #11922
- [https://nvbugs/6029864][fix] Fix flaky ray test failure by @brb-nv in #12697
- [https://nvbugs/5813192][fix] Make trust_remote_code opt-in in MultimodalModelRunner by @yibinl-nvidia in #12669
- [None][infra] Bump etcd to 3.6.9 to involve grpc fix by @yuanjingx87 in #12594
- [https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters by @brb-nv in #12815
- [None][feat] AutoDeploy: Add the Triton kernel for MLA by @nvchenghaoz in #12664
- [None][fix] replace busy-poll sleep in get_async_noblock with zmq async poller by @edenfunf in #12189
- [https://nvbugs/6018647][test] Add unit test for Lifecycle Race Condition error in disagg sever by @yingguo-trt in #12803
- [None][infra] Add DSR1 DSV32 K2 Disagg Perf Tests Back by @chenfeiz0326 in #12688
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12765
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12814
- [None][fix] Fix VLM guided decoding startup crash due to missing vocab_size_padded property by @stefanpantic in #12284
- [None][fix] Fix Nano chunked prefill by @2ez4bz in #12782
- [https://nvbugs/6029220][fix] Disable multi-stream in maybe_execute_i… by @liji-nv in #12659
- [None][test] remove unused tests by @xinhe-nv in #12625
- [https://nvbugs/6000658][fix] Fix disagg gen-only hang where 10s sleep in can_forward blocks KV transfers and overflows CTX memory by @peihu-nv in https://github.com/NVIDI...
v1.2.1
v1.3.0rc11
Highlights
- Model Support
- API
- Support include_stop_token_in_output in gRPC request manager (#12517)
- Add deprecation warnings on TRT backend entrypoints (#11723)
- Accept strict field in tools and store field in chat requests (#12482)
- Mark TRTLLMSampler as deprecated and update documentation (#11938)
- Move VisualGen APIs to a separate directory (#12538)
- Remove some fields with redefined defaults (#11671)
- Feature
- Apply norm before FC in Eagle (#12561)
- Split MLA DSA custom op for piecewise CUDA graph capture (#12503)
- Optimize host performance for Python cache transceiver (#12273)
- Add Mamba2 MTP SSM cache CUDA kernel for tree-based speculative decoding (#12537)
- Add serve-config-guide skill for basic aggregate single-node serving configs (#12054)
- Add FORCE_CHUNK context chunking policy (#12483)
- Add dense GEMM backend for MoE (#10479)
- Implement gen-first disaggregated scheduling, part 2 (#12239)
- Support EPLB with various MoE backends for Nemotron-H models (#12280)
- Skip softmax via sparsity ratio (#11995)
- Add DWDP (distributed weight data parallelism) support for MoE inference (#12136)
- Add AutoDeploy Super V3 MTP support (#12326)
- Introduce fast path (token IDs + multimodal) for VLMs without re-tokenizing encoded prompts (#11708)
- Add global pool support for suffix automaton speculative decoding (#12130)
- Add Triton paged attention for AutoDeploy (#12642)
- Refactor VisualGen attention backend (#12663)
- Add support of linear attention state for C++ KV cache manager (#12531)
- Add temporally-correlated heuristic-guided indexer TopK for sparse attention (#12385)
- Support MLA generation in TrtllmGen attention backend (#12606)
- Extend Python cache transceiver to support Nemotron (#12150)
- Handle different chat template types (#12336)
- Add multi-turn support for trtllm-bench (#12468)
- Add fused DiT QK Norm + RoPE CUDA kernel for FLUX (#11869)
- Support cache reuse for SSM in KVCacheManagerV2 (#12644)
- Add MLIR-based auto-generated elementwise fusion for AutoDeploy (#12427)
- Add --custom_tokenizer CLI option to trtllm-bench (#12586)
- Support LoRA adapter for Nemotron-H models (#12154)
- Apply multiple host performance optimizations for DSA (#12581)
- Reuse Triton slicing kernel for GDN prefill transpose (#12737)
- Add Trtllm-gen FMHA JIT support (#12612)
- Retune causalConv1d forward dispatch for variable-length and short sequences (#12739)
- Update configuration to enable NVFP4 (#12776)
- Fuse SiLU+Mul in AutoDeploy transform (#12497)
- Fix
- Fix Triton kernels in wheel (#12569)
- Fix DSACacheManager and RocketCacheManager KV cache estimation ignoring num_layers for draft models (#12571)
- Reorder generation_logits to align with final beam search output ordering (#12268)
- Handle CUDA_ERROR_INVALID_VALUE in kv_cache_v2 _is_prop_supported (#12613)
- Fix autotuner OOM for trtllmGen MoE runners at large context length (#12523)
- Always sync sampler_event in update_requests (#12585)
- Avoid counting KV cache uses during warmup for Prometheus KV cache metrics (#12132)
- Fix lost requests (#12348)
- Fix GPTOSS CUTLASS MoE on Hopper NVLink one-sided workspace overflow (#12666)
- Fix Mooncake dynamic load in transfer_agent_binding (#12181)
- Fix disaggregated pipeline-parallel hang (#12528)
- Correct reused block counting in corner case (#12404)
- Clamp block indices to prevent out-of-bounds in DSA with MTP (#12657)
- Synchronize NCCL memory allocation error handling (#12125)
- Adjust prompt logprobs to use the correct prompt token id (#12499)
- Improve NIXL agent import error diagnostics (#12446)
- Fix disaggregated serving hang on block reuse after eviction (#12667)
- Use the first non-None result returned by Hugging Face download workers (#12259)
- Replace assertions with warnings for unsupported logits/logprobs in speculative sampler (#12547)
- Address H20 weights loading OOM for GPTOSS (#11321)
- Improve Harmony parser (delta grouping, reuse report, test coverage) (#12467)
- Fix hang issues on DGX B200 8-GPU PyTorch configurations (#12656)
- Fix disaggregated KV cache router for chat API; add disaggregated benchmark for ai_perf (#12337)
- Fix CUDA event crash with performance metrics (#12639)
- Update Nemotron-H handling for corner cases (#12620)
- Fix KV cache issue (#12673)
- Fix wrong token suppressed with ignore_eos in Torch sampler (#12358)
- Fix GPTOSS chat template for disaggregated tests (#12724)
- Fix top-K logprobs size for pipeline parallelism (#12623)
- Remove clone in FP8 quantization (#12687)
- Fix Qwen2.5 mixed precision accuracy issue (#12609)
- Fix Mamba metadata prefill bubble in chunked prefill serving (#12736)
- Fix outdated README argument for executorExampleDisaggregated.cpp (#12276)
- Documentation
- Add MoE developer guide for fused_moe module (#12534)
- Update supported models to include Kimi K2/K2.5 and GLM-5 (#12654)
- Publish blog post for DWDP (#12725)
- Add visual generation models to supported models page (#12464)
- Clean up latest news and blogs; update overview and highlight visual generation (#12753)
- Update C++ coding guidelines (#12577)
- Test & Infra
- Use shared utility for node labels (#9095)
- Adjust RocketKV test threshold (#12527)
- Enhance performance tests with GPU availability check in test_perf.py (#12535)
- Move AD performance regression tests to AD pre- and post-merge jobs (#12461)
- Remove Model Registry Check from workflows; check runs in pre-commit (#12590)
- Add Ubuntu 24.04 wheel image for SBSA (#12436)
- Pin mypy version due to dependency conflicts (#12650)
- Fix Pyxis error in disaggregated performance test (#12575)
- Skip already-applied patches gracefully in third-party FetchContent (#12550)
- Add container scanning to PLC nightly pipeline (#12549)
- Use JobBuilder to trigger downstream job (#7079)
- Prefer GitHub then GitLab for TOT waive list (#11063)
- Isolate single-GPU Ray orchestrator tests to avoid CI timeouts (#12616)
- Add workaround for trtllm-bench hang and improve robustness (#12655)
- Bump tornado and black in container (#12600)
- Remove OOM test case from L40S test list (#12685)
- Temporarily disable warn_unused_ignores (#12728)
- Add supplemental Ruff lint for legacy files via ruff-legacy hook (#11469)
- Add port conflict retry for disaggregated multi-process tests (#12618)
- Add CI agent failure analysis to L0 merge request pipeline (#12543)
- Fix source code scanning (#12773)
- Remove gpu-shell tool from ad-run-agent (#12418)
- Move to FlexCache in Austin for 5080 nodes (#12615)
What's Changed
- [https://nvbugs/5882636][fix] Fix triton_kernels in wheel by @dongfengy in #12569
- [https://nvbugs/5919796][test] AutoDeploy: unwaive Super V3 autodeploy failure by @galagam in #12556
- [None][test] Waive another flaky test case on Dis-agg serving with Ne… by @nv-guomingz in #12587
- [#11992][fix] Support include_stop_token_in_output in gRPC request manager by @CatherineSue in #12517
- [None][feat] Eagle: Norm before FC by @IzzyPutterman in #12561
- [#10607][fix] moved AD perf regression tests to AD jobs pre and post merge by @MrGeva in #12461
- [None][infra] Waive 1 failed cases for main in post-merge 2626 by @ZhanruiSunCh in #12592
- [TRTLLM-7335] [infra] Use shared utility for node labels by @niukuo in #9095
- [None][infra] Waive 1 failed cases for main in pre-merge 31714 by @ZhanruiSunCh in #12589
- [https://nvbugs/6007197][fix] Adjust RocketKV test threshold by @heyuhhh in #12527
- [None][test] Enhance performance tests by adding GPU availability check in test_perf.py by @yufeiwu-nv in #12535
- [None][infra] Waive 2 failed cases for main in post-merge 2627 by @ZhanruiSunCh in #12605
- [None][fix] Fix DSACacheManager and RocketCacheManager KV cache estimation ignoring num_layers for draft models by @lancelly in #12571
- [None][doc] Add MoE developer guide for fused_moe module by @xxi-nv in #12534
- [None][chore] Remove Model Registry Check from workflows, the check already runs in pre-commit by @tcherckez-nvidia in #12590
- [https://nvbugs/5983390][perf] Split MLA DSA custom op for piecewise CUDA graph capture by @liji-nv in #12503
- [None][fix] Reorder generation_logits to align with final beam search output ordering by @achartier in #12268
- [TRTC-351][chore] Deprecation warnings on TRT backend entrypoints by @venkywonka in #11723
- [TRTLLM-10804][infra] add ubuntu2404 wheel image for SBSA by @niukuo in #12436
- [#12288][feat] Add Mistral 4-small support to AutoDeploy by @bmarimuthu-nv in #12266
- [None][infra] waive failed case for main by @EmmaQiaoCh in #12621...
v1.3.0rc10
Highlights
-
Model Support
-
API
-
Feature
- Add CuTe DSL single-pass multi-CTA cluster top-k (#12354)
- Account for reusable KV cache blocks in micro-batch scheduler capacity scheduling (#11637)
- Add raster-along-M/N support for blockscaled contiguous backbone kernels in CuteDSL MoE (#12079)
- Add stride support for
conv1dandfused_sigmoid_gating_delta_rule_update(#12442) - Add a safe allgather implementation with chunking (#12174)
- Add dynamic SMEM block routing in MoE (#12456)
- Optimize
mamba_mixer2.pydecode performance (#11843) - Add PDL support to CuTE DSL top-k kernels (#12506)
- Add FlexKV support (#12512)
- Add a KV cache-aware ADP router for prefix-affinity request routing (#12315)
-
Fix
- Fix KV token estimation when ADP is enabled (#12099)
- Fix Eagle MLA target with GQA draft support (#12171)
- Fix Qwen 3.5 3D position ID handling (#12114)
- Switch tests to
TorchSamplerand fix related bugs (#12200) - Use
ceil_divfor head and size sharding (#12441) - Remove redundant D2H synchronization to improve performance (#12445)
- Fix parallel WAN VAE when
return_dict=True(#12460) - Fix Triton resmooth kernel crashes on SM100f for large MoE grids (#12397)
- Use a model-level warmup cache key for visual generation pipelines (#12516)
- Add NVTX annotations in
sampler.py(#12459) - Use
extra_visual_gen_optionsto improve visual generation routing (#12487)
-
Documentation
-
Test & Infra
- Save unittest subtest results periodically (#11850)
- Fix the B200 aggregated CI perf test MPI issue (#12347)
- Fix LoRA config handling when the provided config count is below requirements (#12409)
- Add a unit test for
load_state_dictsafetensors fallback (#12408) - Replace the skipped TRTLLM NVFP4 test in the B300 CI list (#12454)
- Fix the ltx-2 model checkpoint issue in VBench eval tests (#12463)
- Fix the concurrent write issue in perf tests (#12484)
- Update dependencies to align with the NGC PyTorch 26.02 stack (#12102)
- Consolidate PyTransceiver code (#12342)
- Add Eagle coverage with different input/output cases on Spark (#12520)
What's Changed
- [None][infra] Waive 4 failed cases for main in post-merge 2611 by @ZhanruiSunCh in #12433
- [None][test] Fix lora config less than required config number by @yufeiwu-nv in #12409
- [https://nvbugs/5916151][fix] Unwaive test_fused_moe_w4a8_nvfp4_fp8[TRTLLM] by @xxi-nv in #12400
- [https://nvbugs/5963423][fix] Fix kv token estimation when ADP is on. by @dominicshanshan in #12099
- [TRTLLM-11229][infra] Save unittest subtest results periodically by @yiqingy0 in #11850
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12426
- [https://nvbugs/5997090][fix] Fix B200 Aggregated CI Perf Test MPI Issue by @chenfeiz0326 in #12347
- [TRTLLM-10407][perf] Add cute dsl single pass multi cta cluster topk by @limin2021 in #12354
- [TRTLLM-11070][feat] Account for reusable KV cache blocks in micro batch scheduler capacity scheduling. by @SimengLiu-nv in #11637
- [None][chore] Fixing guardword check by @pcastonguay in #12455
- [None][infra] Waive 1 failed cases for main in post-merge 2610 by @ZhanruiSunCh in #12434
- [None][feat] CuteDSL MOE: Add raster along M/N support for blockscaled contiguous backbone kernel by @liyuhannnnn in #12079
- [None][fix] Switch tests to TorchSampler and fix bugs by @Funatiq in #12200
- [TRTLLM-10061][fix] Use ceil_div for head/size calculations by @VALLIS-NERIA in #12441
- [TRTLLM-10061][feat] Add stride support for conv1d and fused_sigmoid_gating_delta_rule_update by @VALLIS-NERIA in #12442
- [None][fix] Eagle: MLA Target + GQA Draft by @IzzyPutterman in #12171
- [None][doc] fix outdated code references in tech blogs 2, 3, 4, 8, 9, 11 by @schetlur-nv in #12338
- [TRTLLM-11471][feat] Add safe version of allgather with chunking by @chienchunhung in #12174
- [None][perf] add Dynamic SMEM block routing in MOE by @jiahanc in #12456
- [TRTLLM-11544][feat] Add Qwen 3.5 supporting(NVFP4). by @nv-guomingz in #12302
- [https://nvbugs/5997090][fix] Add Disagg Perf Test back as MPI Issue has been fixed by @chenfeiz0326 in #12458
- [https://nvbugs/5841976][fix] Remove test_fused_moe_alltoall_fp4[DeepEP] from waives by @xxi-nv in #12405
- [None][infra] Waive 2 failed cases for main in post-merge 2613 by @ZhanruiSunCh in #12473
- [https://nvbugs/5866619][test] Add unit test for load_state_dict safetensors fallback by @crazydemo in #12408
- [None][feat] Fuse all_reduce with norm for nemotron_h models by @Wanli-Jiang in #12410
- [None][infra] Update CI allowed list by @yuanjingx87 in #12488
- [https://nvbugs/6013562][test] Update waive by @xinhe-nv in #12492
- [None][feat] Small optimizations for mamba_mixer2.py decode by @hnover-nv in #11843
- [None][infra] Waive flaky DeepSeekV3Lite disagg serving test by @hyukn in #12494
- [#11526][chore] AutoDeploy accuracy tests: Use Llama3.1-8B-Instruct official checkpoints by @galagam in #12285
- [https://nvbugs/6007285][fix] Replace skipped TRTLLM NVFP4 test in B300 CI list by @xxi-nv in #12454
- [https://nvbugs/5983390][fix] Remove redundant D2H sync to optimize perf by @hyukn in #12445
- [https://nvbugs/5987470][fix] BREAKING: Do not normalize log probs by default by @achartier in #12366
- [TRTLLM-11622][fix] fix parallel WAN vae when return_dict=True by @NVShreyas in #12460
- [None][infra] Waive pre-merge failed 5090 test by @yuanjingx87 in #12486
- [None][infra] Waive flaky DeepSeekV3Lite disagg serving test by @bo-nv in #12518
- [None][chore] Fix ltx-2 Model Checkpoint Issue in VBench Eval Tests by @yibinl-nvidia in #12463
- [https://nvbugs/5962591][fix] Fix Triton resmooth kernel crash on SM100f for large MoE grids by @Barry-Delaney in #12397
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12495
- [None][doc] Document temperature-adjusted logprobs in TRT backend by @achartier in #12514
- [None][feat] Add PDL support to CuTE DSL top-k kernels by @limin2021 in #12506
- [None][infra] Waive 4 failed cases for main in post-merge 2617 by @ZhanruiSunCh in #12536
- [None][doc] Update Python coding guidelines. by @hnover-nv in #12439
- [#12290][fix] Qwen 3.5 fix 3d position ID handling by @bmarimuthu-nv in #12114
- [TRTLLM-10820][infra] Update dependencies to align with NGC PyTorch 26.02 stack by @EmmaQiaoCh in #12102
- [https://nvbugs/6015329][fix] Use model-level warmup cache key for visual gen pipelines by @karljang in #12516
- [TRTLLM-9523][chore] PyTransceiver code consolidation by @Shixiaowei02 in #12342
- [None][test] Add different input-output of eagle cases on Spark by @JennyLiu-nv in #12520
- [https://nvbugs/6011086][fix] Fix Perf Test's Concurrent Write Issue by @chenfeiz0326 in #12484
- [None][fix] NVTX annotation in sampler.py by @ixlmar in #12459
- [https://nvbugs/5998489][feat] Adding support for request priority in LLM API by @pcastonguay in #12362
- [None][feat] Add support for FlexKV by @pcastonguay in #12512
- [None][feat] KV cache-aware ADP router for prefix-affinity request routing by @lancelly in #12315
- [https://nvbugs/6008183][fix] Use extra_visual_gen_options to help de… by @JunyiXu-nv in https://github.com/NVIDIA/T...
v1.3.0rc9
Highlights
- Model Support
- Add Qwen3-next attention DP support (#10218)
- Improve DeepSeek-V3.2 NVFP4 indexer GEMMs and routing kernels (#11989, #12055)
- Support KV cache and speculative decoding in the Trtllm-Gen attention backend (#11667, #12267)
- Add audio support and chunked-prefix enablement for Nemotron models (#12191, #12414)
- Add GLM 5 support and fix DSA MTP issues (#11990)
- Add initial Qwen3.5 text model support for the PyTorch backend with BF16/FP8 (#12242)
- API
- Add energy metrics to
trtllm-serveand benchmarking workflows (#11855) - Expose
video_pruning_rateinllmargsand improve Nano V2 VL handling (#12194) - Add
TLLM_PROFILE_LOG_RANKSto control per-rank step logging (#12263) - Improve the serve CLI with renamed flags and
mm_embedding_serveenhancements (#12105) - Add an
autooption for tool and reasoning parsers (#12104) - Support interleaved thinking in
trtllm-serve(#12199) - BREAKING: Set the default KV cache transfer timeout to 60 seconds (#12249)
- Add energy metrics to
- Feature
- Add FP8 combine support in
moe_a2a(#11844) - Add batch generation support to visual generation pipelines (#12121)
- Improve request management in the sampler (#11861)
- Add fused AllReduce + RMSNorm with optional residual support (#12201)
- Add constraint-based memory partitioning and a Python scheduler for
KVCacheManagerV2(#12212, #11939) - Add LM head sharding (#12252)
- Add an interactive recipe selector with curated configs and button-grid UI (#11917)
- Improve DSA and FlashMLA performance with new kernel fusions and cached tile-scheduler metadata (#12322, #12161)
- Improve model performance with CuteDSL
indexer_top_k, FlashInfer MLP activation, and refined KV cache buffer sizing (#12236, #12131, #12274)
- Add FP8 combine support in
- Fix
- Fix disaggregated perf test result generation, env export, and port allocation issues (#12211, #12140)
- Fix harmony and tool-calling parsers for agentic coding use cases (#12045)
- Fix torch.compile compatibility by routing DSA attention through the MLA custom op (#12186)
- Fix
min_tokenshandling for long prompts and return explicit scheduling errors when requests cannot be placed (#12166, #12206) - Fix KV cache V2 OOMs and weight-loading OOMs in disaggregated serving (#12188, #12377)
- Fix lost requests, dummy-request crashes, and
GUIDE_TYPE_STRUCTURAL_TAGhandling in request management paths (#12197, #12403, #12330) - Fix W4A16 AWQ bias handling on SM100 and add bias support to
WeightOnlyQuantLinearMethod(#12190, #12317) - Fix MiniMax model loading and multimodal loading error propagation (#12182, #12331)
- Fix MTP/DSA reliability, PARD accuracy, and NVFP4 MoE mixed-precision scales (#12010, #12360, #12240)
- Fix DGX Spark multi-node hangs, cross-node rollout issues in Verl, and
CUDA_VISIBLE_DEVICESpropagation in scripts (#12316, #11924, #12370) - Fix build and runtime issues for SM103 context-attention kernels, L40s IB transfers, LlavaNext dtype fallback, and MnnvlMemory resource cleanup (#12248, #12152, #12169, #11979)
- Add warmups to avoid AIPerf timeouts and I2V torch.compile recompilation (#12178, #12351)
- Pre-cache aesthetic predictor weights to avoid VBench 429 failures (#12127)
- Documentation
- Test & Infra
- Limit pre-merge pre-commit checks to changed files (#11379)
- Use CPU affinity instead of raw CPU count for default build parallelism (#12167)
- Add broader performance, accuracy, and end-to-end coverage for Nemotron, DeepSeek-V3.2, disaggregated serving, FLUX, and DSA host-cache offload (#12184, #12142, #12275, #12279, #12278, #12153)
- Update multi-node and MPI-related test coverage (#12075, #12300)
- Add SSH key authentication support for SLURM clusters (#12172)
- Use the public PyTorch index as a CI fallback and update the CI allowlist (#12261, #12296)
- Enable type checking for sampler modules and improve Python KV transceiver coverage (#11678, #11574)
- Remove outdated QA coverage and refactor benchmarking and test infrastructure (#12277, #12344, #12124, #11720, #12192)
What's Changed
- [TRTLLM-10929][feat] add fp8 combine in moe_a2a by @dc3671 in #11844
- [TRTLLM-9767][feat] Enable attention dp for qwen3-next. by @nv-guomingz in #10218
- [None][fix] Fix Disagg Perf Test No result.xml Bug by @chenfeiz0326 in #12211
- [https://nvbugs/5955188][fix] Fix harmony parsers for agentic coding use cases by @dongfengy in #12045
- [https://nvbugs/5973536][fix] Route DSA attention through MLA custom op for torch.compile compatibility by @yizhang-nv in #12186
- [https://nvbugs/5823135][fix] Fix min_tokens not respected when prompt is long by @JunyiXu-nv in #12166
- [None][doc] Blog18 for NVLinkOneSided AlltoAll. by @bobboli in #12195
- [None][chore] Remove closed bugs by @xinhe-nv in #12222
- [None][fix] Fix KV cache V2 OOM with separate draft KV cache (EAGLE3/MTP) by @yizhang-nv in #12188
- [None][doc] AutoDeploy: ad-model-onboard skill updates by @bmarimuthu-nv in #12234
- [TRTLLM-10569][infra] Only check the changed files in pre-commit in pre-merge CI by @yiqingy0 in #11379
- [https://nvbugs/5948878][fix] fix lost requests by @bo-nv in #12197
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12218
- [None][chore] fix deepep trtllm backend MXFP4 by @leslie-fang25 in #12219
- [None][chore] Alltoall benchmark script refine (second time). by @bobboli in #12192
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12220
- [None][fix] Fix W4A16 AWQ bias not applied on SM100 (Blackwell) by @Tracin in #12190
- [None][fix] Export computed env vars to env_vars.json and fix port allocation in disagg benchmark by @qiaoxj07 in #12140
- [TRTLLM-11288][fix] Adapt LTX2 pipeline to CompilationConfig warmup interface by @luyiyun1021 in #12232
- [https://nvbugs/5955927][fix] Add warm up before aiperf to fix timeout issue. by @dominicshanshan in #12178
- [None][refactor] Improve request management in sampler by @Funatiq in #11861
- [None][chore] Use affinity rather than CPU count for default build parallelism by @achartier in #12167
- [None][feat] Support kv cache in Trtllm-Gen attention backend by @yihwang-nv in #11667
- [None][docs] Update nemotron 3 super deployment to include tool calling and reasoning parser by @tijyojwad in #12215
- [None][fix] Add more models to increase perf test coverage by @chenfeiz0326 in #12184
- [TRTLLM-9521][feat] Unfuse indexer.wk from attention GEMM for DS-V3.2 NVFP4 by @peihu-nv in #11989
- [https://nvbugs/5879588][fix] fix MiniMax model loading bugs by @jmydurant in #12182
- [TRTLLM-10333][feat] Add energy metrics in trtllm-serve and benchmark… by @JunyiXu-nv in #11855
- [None][test] Update nemotron super test cases with official ckpt. by @nv-guomingz in #12142
- [None][fix] Reliability fixes for MTP with DSA and support host cache offload for DSA by @dmtri35 in #12010
- [None][infra] Waive 5 failed cases for main in post-merge 2599 by @ZhanruiSunCh in #12283
- [None][infra] use public torch index as CI backup by @tburt-nv in #12261
- [TRTLLM-11362][feat] Add batch generation support to visual gen pipelines by @karljang in #12121
- [https://nvbugs/5973801][fix] exclude subproc_worker_timer from thread leak checks by @MrGeva in #12286
- [#11432][feat] AutoDeploy: Enable fp8 quantization fusion part 1 by @galagam in #11910
- [#10931][feat] AutoDeploy: one-model spec dec by @lucaslie in #11701
- [https://nvbugs/5973536][fix] Add NVFP4+FP8KV+MTP accuracy specs for DeepSeek-V3.2-Exp by @yizhang-nv in #12269
- [#11368][fix] FP4 CUTLASS GEMM shared memory overflow on GB10 (SM121) by @mihai-chiorean in #12141
- [TRTLLM-11267][feat] Add audio support for nemotron by @2ez4bz in #12191
- [None][feat] GLM 5 support and DSA MTP fixes by @NVShreyas in #11990
- [None][fix] Relax MoE test tolerance for fp16 TP mode accuracy mismatch by @xxi-nv in https://github.com/NVIDIA/Tenso...
v1.3.0rc8
Highlights
-
Model Support
- Nemotron 3 Super support
- Add tool parser support for GLM-4 models (#11986)
- Implement dynamic resolution for Nemotron VL (#11894)
- Enable mixed quantization support for Nemotron-H Mamba (#11972)
- Add VisualGen FA4 attention backend support (#11697)
- VisualGen support for LTX-2, Wan and FLUX (#12009)
- Add TRTLLM-Gen kernels for GLM4.7 and support
groupsTokensHeadsQande2m1output (#11643) - Support attention-DP for TRTLLM-Gen NVFP4 MoE (#12156)
-
API
-
Feature
- Add basic SSM support in
KVCacheManagerV2(#11976) - Improve KV event batching (#11883)
- Add 2FP4 / Arcquant support (#11333)
- Adapt the transceiver to manager v2 (step 6) (#11978)
- Add shared expert LoRA support for MoE models in the PyTorch backend (#11760)
- Add dynamic draft length on the one-model speculative decoding path (#10860)
- Enable configurable warmup shapes for VisualGen (#12107)
- Add FlashInfer API support for
TRTLLMGenFusedMoE(#10453) - Add Python cache transceiver support for gen-first workflow (#11941)
- Add basic SSM support in
-
Fix
- Upgrade Cutlass version (#11956)
- Fix DS v32 tool calling type and parse errors (#11935)
- Fix protobuf and
aiohttpvulnerabilities (#11898) - Fix NVFP4 sharding (#11618)
- Fix Kimi-K2.5 accuracy test skip condition and reference configs (#11930)
- Pass
sparse_attn_configfromeffective_draft_configfor one-model draft KV cache (#12032) - Fix MTP advanced sampling top-k IMA (#12088)
- Revert refactor of the KV connector integration in py_executor, which caused issues with KVBM (#11872)
- Fix sharding overwrite with multiple graph modules (#12051)
- Fix various agentic flow issues (#12061)
- Split
mContextChunkSizeinto per-target and per-draft fields (#12058) - Fix
ValueErrorand missing decoding statistics for MTP (#12063) - Improve NCCL library load stability (#12015)
- Disable TRTLLM-Gen routing PDL due to NaN issues (#11994)
- Enforce a minimum
NVSHMEM_QP_DEPTHof 128 for DeepEP low latency (#12100) - Narrow a bare
exceptclause and use identity checks forNone(#12041) - Fix MoE DeepEP hangs caused by non-deterministic GC (#12060)
- Fix
KVCacheManagerV2shrink behavior for the last level and improveinit_ratio(#12112) - Fix Mamba cache handling for PP > 1 (#12146)
- Handle
anyOfparameter schemas in the Qwen3Coder tool parser (#12173) - Add explicit errors for intermediate-size misalignment with the FP8 block size (#12101)
- Fix DeepEP with the TRTLLM MoE backend for sequence length 1 (#12158)
- Improve port retry loops and exception handling (#12225)
- Add streaming support for
no </think>on Nemotron models (#12176)
-
Documentation
-
Benchmark
- Add QA perf test cases with L0 local mode (#12022)
- Align performance benchmark output format (#12067)
- Improve sampler performance by replacing
torch.wherewithmasked_fill_(#11949) - Add a fused
cat+fp8_quantizeCUDA kernel for the DSA indexer (#11899) - Optimize long-sequence token-parallel prefill for the DSA indexer (#11871)
- Reduce
logprobs=0overhead inTorchSampler(#11983) - Refine AlltoAll benchmark scripts (#11649)
- Optimize the Q3N decode kernel with IO reads (#11344)
- Fix disaggregated gen-only benchmark coverage (#12091)
- Fix MPI issues and port conflicts in disaggregated performance tests (#12020)
- Add GB200 performance sanity tests to the QA test database (#11882)
- Refactor parallel VAE support (#12123)
- Optimize 6KD FP8 blockscale GEMM (#11502)
- Optimize Qwen3.5 performance (#11581)
- Restore 3 disaggregated gen-only tests (#12159)
-
Test & Infra
- Fix disaggregated SKU coverage (#12065)
- Fix upload build info branch handling and ensure it always runs in post steps (#12025)
- Fix the CI issue for Mistral Large3 (#12073)
- Enable more KV connector priority tests in CI (#11892)
- Add speculative decoding tests for
exclude_input_in_output=true(#12080) - Add E2E tests for the KV cache connector async loading path (#12053)
- Change the image used for the CI preparation step (#12086)
- Add the
verlstage in CI (#11306) - Add multi-node E2E and accuracy cases on DGX-Spark (#12110)
- Update NumPy to version 2 (#11280)
What's Changed
- [None][feat] Add Auto-Deploy dashboard failures analysis skill by @tcherckez-nvidia in #12033
- [https://nvbugs/5820511][fix] Upgrade Cutlass version by @pamelap-nvidia in #11956
- [None][feat] Add AD model list validation checks to pre-commit and PR… by @tcherckez-nvidia in #12036
- [None][chore] Clarify DCO sign-off and co-author guidelines in AGENTS.md by @kaiyux in #12034
- [TRTLLM-7784][feat] Basic SSM support in KVCacheManagerV2 by @lowsfer in #11976
- [None][test] Add QA's perf test cases with L0 local mode by @fredricz-20070104 in #12022
- [TRTLLM-11246][feat] Add tool parser support for GLM-4 models by @JunyiXu-nv in #11986
- [https://nvbugs/5937478][fix] Fix DS v32 tool calling type and parse error by @JunyiXu-nv in #11935
- [TRTLLM-11135][fix] Fix vulnerabilities protobuf and aiohttp by @yiqingy0 in #11898
- [None][chore] Align perf benchmark output format by @yingguo-trt in #12067
- [None][chore] Improve sampler performance by replacing torch.where with masked_fill_ by @stnie in #11949
- [None][infra] Waive 1 failed cases for main in post-merge 2582 by @ZhanruiSunCh in #12069
- [TRTLLM-10421][perf] Add fused cat+fp8_quantize CUDA kernel for DSA indexer by @kaiyux in #11899
- [None][test] Fix disagg sku by @fredricz-20070104 in #12065
- [https://nvbugs/5892646][perf] Long-sequence token-parallel optimization for DSA indexer prefill by @nvxuanyuc in #11871
- [TRTLLM-11265][feat] Implement dynamic resolution for Nemotron VL by @2ez4bz in #11894
- [https://nvbugs/5708901][perf] reduce logprobs=0 overhead in TorchSampler by @ixlmar in #11983
- [None][feat] NVFP4 TRTLLM-Gen MoE for AutoDeploy (Nemotron Super) by @tcherckez-nvidia in #11652
- [https://nvbugs/5963896][fix] Remove test
test_visual_gen_quickstarton A10 by @chang-l in #12048 - [TRTLLM-11535][feat] Fixed NVFP4 sharding by @greg-kwasniewski1 in #11618
- [None][fix] Improve KV Event Batching by @jthomson04 in #11883
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12047
- [TRTLLM-11276][fix] Fix Kimi-K2.5 accuracy test skip condition and reference configs by @lancelly in #11930
- [https://nvbugs/5919026][fix] Pass sparse_attn_config from effective_draft_config for one-model draft KV cache by @chenfeiz0326 in #12032
- [None][fix] MTP Advanced Sampling Topk IMA by @IzzyPutterman in #12088
- [None][fix] Revert "[None][chore] KV Connector Refactor (#11078)" by @jthomson04 in #11872
- [None][chore] Bump version to 1.3.0rc8 by @yuanjingx87 in #12090
- [None][chore] Refine AlltoAll benchmark scripts. by @bobboli in #11649
- [None][feat] 2FP4 / Arcquant. by @Tracin in #11333
- [None][fix] Fix Upload Build Info branch and run in post always by @mzweilz in #12025
- [TRTLLM-11366][feat] Add dedicated virtual memory tag for model weights, configurable restore mode by @tongyuantongyu in #11889
- [https://nvbugs/5961430][fix] Fix CI issue of Mistral Large3 by @byshiue in #12073
- [None][test] add Perf sanity gb200 test into QA test db by @xinhe-nv in #11882
- [None][infra] Waive 2 failed cases for main in post-merge 2584 by @ZhanruiSunCh in #12108
- [None][chore] Waive mpi hang test case by @jieli-matrix in #12077
- [None][chore] re-enable benchmark test in post merge by @zhenhuaw-me in #12035
- [None][feat] Mamba optimization and mixed quantization support for nemotron-h by @Wanli-Jiang in #11972
- [None][fix] Various fixes for agentic flow by @2ez4bz in #12061
- [https://nvbugs/5936322][fix] Fix sporadic port collision in multigpu AutoDeploy tests by @MrGeva in https://githu...
v1.2.0
Highlights
-
Model Support
- Added beta support for K-EXAONE, Nemotron Nano V3, Qwen3-Next and Qwen3-VL.
- Improved GPT-OSS, Nemotron, EXAONE, GLM, Starcoder2, Qwen3, KimiK2, DeepSeek v3.2 and Mistral Large 3 support and validation.
- Expanded Blackwell/Hopper/Ampere enablement including B300/GB200/GB300 and SM120/SM121/SM103 paths.
- Broadened low-precision and MoE capabilities (FP8/NVFP4/MXFP4/INT4-AWQ), including routing and kernels.
-
Features
- Speculative Decoding:
- Enabled MTP>1 support for DeepSeek v3.2
- Disaggregated Serving:
- Added service discovery mechanism for dynamic scaling
- Added support for cancelling requests
- Added NIXL-LibFabric support
- Added support for Mooncake transfer engine as a cache transceiver backend
- Sampling:
- Implemented batched sampling using FlashInfer sampling
- Added support for returning logprobs incrementally with streaming mode in PyTorch backend
- Added Beam Search support to TorchSampler
- Performance:
- Improved TorchSampler performance
- Enabled PDL by default and added PDL support for indexer TopK and additional kernels.
- Improved trtllm-gen kernels
- Enabled early exit with overlap scheduler
- Added NUMA-aware CPU affinity automatic configuration
- Expert Parallelism:
- Enabled EPLB for trtllm-gen and cutlass backend
- Enabled CuteDSL MoE with large EP
- Added CUDA graph support for DeepEP
- Multiple performance improvements
- Hardware:
- DGX Spark Support (Beta)
- Others:
- Helix parallelism support
- New Ray orchestrator type
- Speculative Decoding:
-
Documentation
- Deployment Guides:
- Added comprehensive deployment guides for KimiK2, Qwen3 and Qwen3-Next.
- Added new guide on CPU Affinity configuration.
- Updated GPT-OSS guide.
- Developer Guides:
- Added developer guide about KV Cache Transmission.
- New section on MoE Expert Load Balance Analysis (Perfect Router) in Performance Analysis guide.
- New section on API Change Principles in LLM API Change guide.
- Feature Documentation:
- Created new guides for Additional Outputs, Helix Parallelism, KV Cache Connector, Ray Orchestrator, Sparse Attention and Torch Compile & Piecewise CUDA Graph.
- Also updated the Feature Combination Matrix and Paged Attention, IFB, and Request Scheduling guide.
- Tech Blogs: Published blogs on:
- Examples:
- Added new section on disaggregated serving service discovery method.
- Added examples for K-EXAONE, Nemotron Nano V2 VL and Nemotron Nano V3.
- Added RocketKV usage documentation.
- Deployment Guides:
-
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:25.12-py3. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:25.12-py3. - The dependent public PyTorch version is updated to 2.9.1.
- The dependent transformers version is updated to 4.57.3.
- The dependent triton version is updated to 3.5.1.
- The dependent NIXL version is updated to 0.8.0.
- The base Docker image for TensorRT-LLM is updated to
-
API Changes
- Breaking Changes:
- FlashInfer sampling now used by default with PyTorch backend.
- Changes to sampling strategy in some previously undefined cases.
- OpenAI API:
- Enabled n > 1 with PyTorch backend
- Added support for GET/DELETE v1/responses
- Breaking Changes:
-
Fixed multiple Issues
-
Known Issues
- DGX Spark: DGX Spark support is in beta. Only single-node configurations and the models listed above have been validated in this release.
- Disaggregated Serving: A hang may occur in disaggregated serving with context pipeline parallelism and generation tensor parallelism configurations.
v1.3.0rc7
Highlights
-
Model Support
- Support tensor parallelism of TRTLLM MoE backend for Nemotron-H model (#11470)
- Add Kimi-K2.5 text model support (NVFP4) (#11777)
- Add Helix CP support for DSV3.2 (#11507)
- Support mix quantization between shared experts and routed experts for DSV3 (#11215)
- Support Cohere Command A model (#11505)
- Extract embeddings as
.safetensorsand support float8-quantized models (#11180)
-
API
- Add
--served-model-nameoption toservecommand (#11711) - Add flag to
trtllm serveto override KV cache dtype (#11487) - Use string stop/bad words in gRPC proto instead of pre-tokenized
TokenSequence(#11888) - Support multimodal image input in gRPC server (#11800)
- Expose
use_python_schedulerinSchedulerConfigand add associated tests (#11884) - Add
max_gpu_total_bytesto control KVCacheManagerV2 capacity (#11907)
- Add
-
Feature
- Support PARD (Parallel Draft Model) in one-model speculative decoding (#11438)
- Enable autotuner for VisualGen and compilation config support (#11660)
- Add globaltimer-based timing backend for autotuner profiling (#11657)
- Support heterogeneous
tokens_per_block(#11751) - Refactor KVCacheManagerV2 to simplify new model support (#11749)
- Support Helix CP with GQA (#11570)
- Add option to skip KV cache memory estimation (#11714)
- Implement suffix automaton on device for speculative decoding and one-model support (#11434)
- Separate radix search tree implementation (#10862)
- Add support for
expert_number(\le 2048) andK(\le 32) (#11510) - Add support for bidirectional sliding window attention mask to
fmha_v2(#11212) - Avoid duplicated computation with ADP + Helix CP in GQA (#11891)
- Add explicit video encode format support (#11830)
- Refactor video encoding to use ffmpeg CLI or pure Python fallback (#11672)
- Integrate CuTe DSL top-k kernel for Blackwell (#11900)
- Integrate suffix automaton with EAGLE3 and PARD (#11878)
- Add 5D A2A for fused Ulysses (#11787)
- Add SiLU to
trtllm-genMoE (#11663) - Optimize by fusing
nvfp4_quantintolayernorm_gatedformamba2_mixer(#11473) - Wire
KVCacheBlocktoUnifiedBlockTreeusing lookup-node pointers (#11919) - Run extra general warmup to warm up memory pool (#10340)
-
Fix
- Add async worker to MTP/EAGLE3 sampler (#11573)
- Fix disaggregated cancellation (#11730)
- Use
prefer_pinned()inpard.py(#11762) - Release KVCacheManagerV2 memory immediately on shutdown (#11746)
- Remove duplicated MoE computation with Helix CP+DP (#11167)
- Register add+norm fallback pass for
torch.compilein multi-GPU mode (#11739) - Propagate logprobs from prefill to decode in disaggregated serving (#11727)
- Propagate logits from prefill to decode in disaggregated serving (#11767)
- Enable separate draft KV cache pool for aggregated mode and KVBM (#11689)
- Fix warnings when building
moe_kernels.cu(#11703) - Fix
available_blockstypo in scheduler (#11801) - Clean up memory in rollout process (#11658)
- Warm up
maybe_compiled_catinforward_context_with_chunked_prefill(#11743) - Fix DeepEPLowLatency with CuTe DSL MoE backend (#11769)
- Fix FP8 per-tensor
torch.compilegraph break in dynamic quantization (#11759) - Fix streaming generation logits and speed up logits testcase (#10637)
- Fix overly aggressive capacity scheduler (#11731)
- Use proper tokens when
exclude_input_in_outputis true (#9453) - Move
launch_dependent_gridsaftertmemfree to fix race (#11812) - Fix E/PD disaggregated chunked prefill bug (#11805)
- Fix SM120 issue for
rms_normwithnvfp4_quant_fusion(#11774) - Remove dead code (#11813)
- Fix KVCacheManagerV2 OOM and dummy request allocation in chunked prefill / pipeline parallel (#11710)
- Fix AttributeError when DSA indexer accesses non-DSA KVCacheManager (#11858)
- Override
mMaxAttentionWindowwith actual largest window size (#11842) - Update
check_is_moeto supportmlp_layer_typesafterconfig.jsonupdate (#11477) - Fix incorrect GPU timing in time breakdown under overlap scheduler (#11860)
- Fix OOM hang with
NCCL_SYMMETRICfallback during long-context inference (#11870) - Fix position IDs input for Qwen3.5 text-only usage (#11877)
- Disable preload for Llama4 Scout (#11873)
- Fix formatting issue in
tensorrt_llm/serve/openai_server.py(#11920) - Prevent RuntimeError from dict mutation during iteration in EXAONE MoE weight mapper (#11862)
- Fix Nemotron MTP crash on SM90 (#11807)
- Fix Mistral Large3 + EAGLE bug (#11942, #11885)
- Fix TeaCache broken caching for FLUX.1 and FLUX.2 (#11868)
- Fix FLUX.1 TeaCache polynomial coefficients and defaults (#12007)
- Implement workaround for
ClientPayloadError(#12018) - Fix duplicate model entry in model list (#12029)
- Fix Python string truthiness bug in FMHA cubin selection (#11909)
-
Documentation
- Fix typos, grammar, and accuracy across documentation (#11766)
- Add sparse attention tech blog (#11644)
- Add known issue for disaggregated serving hang with asymmetric PP/TP (#11789)
- Fix documentation links (#11912)
- Replace “TensorRT-LLM” with “TensorRT LLM” (#11914)
- Add CI trigger and test-failure retrieval instructions to
AGENTS.md(#11803)
-
Benchmark
- Vectorize
quantize_fp8_blockwisewith CUDA kernel (#11724) - Use
F.rms_normfor per-head QK normalization in VisualGen (#11798) - Short-sequence MHA optimization for DSA MLA prefill (#11677)
- Parallel VAE harness and implementation for WAN (#11875)
- Add Triton FP8 blockwise quant kernel and autotuner bucket-skip for VisualGen (#11854)
- Optimize
_prepare_inputshost time (#11704) - Improve
are_stop_wordsperformance (#11196) - Add DeepSeek RCCA performance test case (#11736)
- Add VisualGen benchmarking script (#11651)
- Vectorize
-
Test & Infra
- Add tests for all database configs (#11653)
- Move B200 test stage to AIHub (#11692)
- Support local wheel installation and add GB300 demo cases (#11742)
- Remove submodule pulls from TRT-LLM git checkouts (#11693)
- Add back WAN VBench test in CI (#11804)
- Add E2E test for cancelled disaggregated generation requests with overlap scheduler (#11795)
- Pass Nsight options to
ray_executorand trigger profiling throughcollective_rpc(#11493) - Add B200 multi-node tests DB (#11783)
- Add sanity tests for release 1.2 version (#11738)
- Add QA test case for
trust-remote-codeon multi-node failure (#11905) - Fix
model_nameStarcoder 15B allowed-models issue (#11981) - Upgrade
xgrammarfrom 0.1.25 to 0.1.32 (#12016) - Limit TileIRAS to CUDA 13.1 (#12042)
- Remove VisualGen benchmark test from YAML (#12027)
What's Changed
- [None][feat] Support tensor parallelism for nemotron-h model by @Wanli-Jiang in #11470
- [None][test] Add tests for all database configs. by @fsaady in #11653
- [https://nvbugs/5911143][fix] add async worker to MTP/Eagle3 sampler,… by @dhansen-nvidia in #11573
- [TRTLLM-10886][feat] Support PARD(Parallel Draft Model) in one-model spec dec by @ziyixiong-nv in #11438
- [None][fix] Fix disagg cancellation by @Tabrizian in #11730
- [None][fix] Use prefer_pinned() in pard.py by @mikeiovine in #11762
- [None][fix] Make KVCacheManagerV2 release mem immediately on shutdown by @lowsfer in #11746
- [TRTLLM-11115][feat] enable autotuner for visual gen + Compilation Config by @NVShreyas in #11660
- [None][chore] Minor fix in w4a8 mxfp4 mxfp8 test. by @Tracin in #11745
- [None][infra] Move B200 test stage to AIHub by @yuanjingx87 in #11692
- [None][infra] Waive failed cases for main on 02/27 by @EmmaQiaoCh in #11770
- [TRTLLM-11064][fix] Remove duplicated MoE Computation with Helix CP+DP by @brb-nv in #11167
- [TRTLLM-10386][fix] torch.compile: register add+norm fallback pass in multi-GPU mode by @luyiyun1021 in #11739
- [None][feat] Support heterogeneous tokens_per_block by @lowsfer in #11751
- [None][chore] Remove closed bugs by @xinhe-nv in #11527
- [None][test] local wheel installation support and add gb300 cases demo by @fredricz-20070104 in #11742
- [None][feat] Refactor cache manager v2 to simplify new model support by @jiaganc in #11749
- [https://nvbugs/5879614][fix] Waive test_guided_decoding_with_eagle3 xgrammar in disaggregated serving by @ziyixiong-nv in #11773
- [https://nvbugs/5911788][test] Waive test_llm_partial_update_weights[Qwen3/Qwen3-8B] by @liji-nv in #11785
- [None][feat] add globaltimer-based timing backend for autotuner profi… by @dhansen-nvidia in #11657
- [https://nvbugs/5926823][fix] Propagate logprobs from prefill to decode in disagg by @brb-nv in #11727
- [TRTLLMINF-9][chore] Remove submodule pulls from TRT-LLM git checkouts by @dpitman-nvda in #11693
- [https://nvbugs/5685010][fix] Delete test_eagle3_output_repetition_4gpus flaky assertions. by @zheyuf in https://github.com/NVI...
v1.3.0rc5.post1
What's Changed
- [None][chore] bump version to 1.3.0rc5.post1 by @tburt-nv in #11788
- [None][fix] Cherry pick cancel fix by @pcastonguay in #11790
- [https://nvbugs/5926823][fix] Cherry-pick: Propagate logprobs from prefill to decode in disagg (#11727) by @pcastonguay in #11792
- [https://nvbugs/5934461][fix] Cherry-picks 11767 (logits support in disagg) by @pcastonguay in #11832
- [https://nvbugs/5935104][fix] Cherry-pick Fix overly aggressive capacity scheduler by @pcastonguay in #11834
- [https://nvbugs/5938603][fix] Cherry-pick Fix E/PD disagg chunked prefill bug (#11805) by @pcastonguay in #11847
- [https://nvbugs/5930934][fix] Cherry-pick fix NCCL OOM hang by @pcastonguay in #11916
Full Changelog: v1.3.0rc5...v1.3.0rc5.post1