Upgrade Dev containers for CICD to latest#891
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughCI workflows, test matrices, and installation docs updated for newer PyTorch/CUDA/tooling (container tags moved to 26.01, CUDA 13 variants added, Torch 2.10 added). Tests guarded by Megatron/TE checks were mostly removed or relaxed, and a new FSDP2 post-backward patch was added to the quantization trainer. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/gpu_tests.yml (1)
86-99:⚠️ Potential issue | 🔴 Critical
gpu-tests-non-prstill referencescuda12tox environments that were removed fromtox.ini.The PR job matrix (lines 66-68) was updated to
cuda13, but the non-PR job matrix (lines 92-94) still usespy312-cuda12-gpuandpy312-cuda12-gpu-megatron. Sincetox.inino longer definescuda12environments (replaced bycuda13), nightly and on-demand runs will fail with an unknown tox environment error.🐛 Proposed fix
matrix: include: - - example: py312-cuda12-gpu + - example: py312-cuda13-gpu timeout: 90 - - example: py312-cuda12-gpu-megatron + - example: py312-cuda13-gpu-megatron timeout: 120.github/workflows/unit_tests.yml (1)
79-86:⚠️ Potential issue | 🔴 Critical
multi-torchmatrix includestorch26which no longer exists intox.ini.As noted in the
tox.inireview,torch26was dropped from the tox env pattern. This matrix entry will producepy312-torch26-tf_latest-unitwhich tox won't recognize. Either addtorch26back to the tox env pattern or remove26from this matrix (and consider adding210if full coverage of the new version is desired).🐛 Suggested fix (if torch26 is intentionally dropped)
matrix: - torch: [26, 27, 28, 29] + torch: [27, 28, 29, 210]
🤖 Fix all issues with AI agents
In `@tox.ini`:
- Around line 14-21: The tox env pattern
[testenv:{py310,py311,py312}-torch{27,28,29,210}-tf_{min,latest}-unit] no longer
includes torch26 but there is still a deps mapping "torch26:
torchvision~=0.21.0" and the CI matrix (multi-torch) still lists 26; fix by
either (A) restoring 26 into the env factor (change torch{27,28,29,210} to
torch{26,27,28,29,210}) so tox, deps, and the workflow matrix align, or (B)
remove the dead deps line "torch26: torchvision~=0.21.0" and remove the "26"
entry from the multi-torch matrix in unit_tests.yml so the workflow no longer
references an unsupported tox env. Ensure both the tox env pattern and the
workflow matrix + deps block are consistent.
🧹 Nitpick comments (1)
docs/source/getting_started/_installation_for_Linux.rst (1)
129-131: Consider splitting the long instruction for readability.The single bullet on line 131 packs multiple commands and conditional logic into one line, which can be hard to scan in rendered docs.
📝 Suggested restructure
-**CUDA specific dependencies** - -* By default, ``cupy-cuda12x`` is installed for INT4 ONNX quantization. If you have CUDA 13, you need to run ``pip uninstall -y cupy-cuda12x`` and ``pip install cupy-cuda13x`` after installing ``nvidia-modelopt[onnx]``. +**CUDA specific dependencies** + +* By default, ``cupy-cuda12x`` is installed for INT4 ONNX quantization. +* If you have CUDA 13, replace it after installing ``nvidia-modelopt[onnx]``: + + .. code-block:: shell + + pip uninstall -y cupy-cuda12x + pip install cupy-cuda13x
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
b7e3883 to
f0bd99c
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #891 +/- ##
==========================================
- Coverage 73.74% 73.54% -0.21%
==========================================
Files 199 205 +6
Lines 21163 22000 +837
==========================================
+ Hits 15606 16179 +573
- Misses 5557 5821 +264 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
59eb8ff to
86e9821
Compare
There was a problem hiding this comment.
@realAsma please help review this cursor-generate fix for passing llm_qat test on PyTorch 2.10 (pytorch:26.01-py3 container)
Failing test without this fix: https://github.com/NVIDIA/Model-Optimizer/actions/runs/22007571666/job/63595325923?pr=891
[rank0]: RuntimeError: attempting to assign a gradient with dtype 'c10::BFloat16' to a tensor with grad_dtype 'Float'. The gradient must match the tensor's grad_dtype (defaults to the tensor's dtype). You can set the tensor's grad_dtype attribute with a specific dtype, or None to allow any dtype. Set grad_dtype with caution. Diverging the dtypes of a tensor and its gradient may break downstream systems that assume they match.
There was a problem hiding this comment.
This error does not seem to be related to Model Optimizer. We should probably raise this to Pytorch of Huggingface. Can we add a note that temp fix and this should be removed in future?
There was a problem hiding this comment.
Thanks Asma for the review. Can you please raise an issue with torch / transformers with required details to them? I will add ther issue link here for tracking
| from packaging import version | ||
|
|
||
|
|
||
| def skip_if_no_tensorrt(): |
There was a problem hiding this comment.
@gcunhase @ajrasane please help review this cursor-generated check for skipping the tests/gpu/onnx/test_plugin.py on Cuda 13 container with our current ORT-GPU intallation.
Failure without this skip check: https://github.com/NVIDIA/Model-Optimizer/actions/runs/22007571673/job/63595332370?pr=891#step:7:364
*************** EP Error ***************
EP Error /onnxruntime_src/onnxruntime/python/onnxruntime_pybind_state.cc:505 void onnxruntime::python::RegisterTensorRTPluginsAsCustomOps(PySessionOptions&, const onnxruntime::ProviderOptions&) Please install TensorRT libraries as mentioned in the GPU requirements page, make sure they're in the PATH or LD_LIBRARY_PATH, and that your GPU is supported.
when using [('TensorrtExecutionProvider', {'trt_max_workspace_size': 85899345920}), ('CUDAExecutionProvider', {'device_id': 0, 'arena_extend_strategy': 'kSameAsRequested'}), ('CPUExecutionProvider', {'arena_extend_strategy': 'kSameAsRequested'})]
Falling back to ['CPUExecutionProvider'] and retrying.
****************************************
----------------------------- Captured stderr call -----------------------------
2026-02-14 01:01:06.659874993 [E:onnxruntime:Default, provider_bridge_ort.cc:2167 TryGetProviderInfo_TensorRT] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1778 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_tensorrt.so with error: libcublas.so.12: cannot open shared object file: No such file or directorySigned-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
86e9821 to
dea2eeb
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/_test_utils/import_helper.py (1)
100-102:⚠️ Potential issue | 🟡 MinorPre-existing bug: skip message is inverted.
The condition on line 100 skips when the installed version is greater than 1.18, but the message says "is less than required". This is a pre-existing issue unrelated to this PR.
Proposed fix
if version.parse(installed_version) > version.parse(required_version): pytest.skip( - f"{package_name} version {installed_version} is less than required {required_version}" + f"{package_name} version {installed_version} is greater than {required_version}" )
🧹 Nitpick comments (1)
modelopt/torch/quantization/plugins/transformers_trainer.py (1)
387-387: Placement note:_patch_fsdp2_post_backward()is called unconditionally.This patches the class-level method regardless of whether FSDP2 is actually in use. The
ImportErrorguard and double-patch sentinel make this safe, but calling it only whenself.accelerator.is_fsdp2would be more precise. Not a blocker since the guards handle it correctly.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/_test_utils/import_helper.py (1)
100-102:⚠️ Potential issue | 🟡 MinorPre-existing bug: skip message says "less than" but condition checks "greater than".
Line 100 skips when
installed_version > required_version, but the message on line 102 says the version "is less than required". This is a pre-existing issue unrelated to this PR but worth noting.Proposed fix
if version.parse(installed_version) > version.parse(required_version): pytest.skip( - f"{package_name} version {installed_version} is less than required {required_version}" + f"{package_name} version {installed_version} is above the maximum {required_version}" )
🧹 Nitpick comments (5)
docs/source/getting_started/_installation_for_Linux.rst (1)
129-131: Consider mentioning[all]in addition to[onnx]for the cupy swap instruction.The cupy swap applies regardless of whether the user installed
nvidia-modelopt[onnx]ornvidia-modelopt[all]. Mentioning only[onnx]may cause users who installed[all]to miss this step..github/workflows/example_tests.yml (1)
68-73: The explicitdocker_image: "26.01"forspeculative_decodingis currently redundant with the default.Since the fallback
|| '26.01'already covers it, the explicit value is unnecessary. However, keeping it is fine if the intent is to make it easy to override per-example in the future. The comment on line 59 hints at this being intentional.tox.ini (2)
62-73: Consider pinning thefast-hadamard-transformdependency to a specific commit or tag.
git+https://github.com/Dao-AILab/fast-hadamard-transform.gitinstalls from the default branch head, meaning CI can break at any time due to upstream changes. Pinning to a known-good commit (e.g.,...transform.git@<commit-sha>) would make GPU test runs reproducible.The
cupy-cuda12x→cupy-cuda13xswap on lines 69–70 is consistent with the CUDA 13 migration.
75-83: Same unpinned-git concern applies tomamba; otherwise LGTM.Line 79 installs
mambafrom the default branch ofstate-spaces/mambawithout a commit pin — same reproducibility note asfast-hadamard-transformabove. The rest of the Megatron GPU environment looks correct.modelopt/torch/quantization/plugins/transformers_trainer.py (1)
104-147: Well-documented and correct implementation — optional refinements available.The idempotency guard, import-safety, and clear documentation of this as a temporary workaround are sound.
Two optional observations:
grad_dtypeis cleared but never restored. After_modelopt_original_post_backward()runs,grad_dtyperemainsNone. While it's re-set toNoneon every subsequentpost_backwardcall, the codebase elsewhere uses save-restore patterns for similar monkey-patching (seeutils.pylines 563–575). If future FSDP2 code paths inspectgrad_dtypebetween backward passes, this could cause unexpected behavior. Consider saving and restoring it:♻️ Optional: restore grad_dtype after the original call
`@torch.no_grad`() def _patched_post_backward(self): # Allow bf16 gradients to be assigned to fp32 parameters + saved_grad_dtypes = {} for fsdp_param in self.fsdp_params: with contextlib.suppress(AttributeError): + saved_grad_dtypes[fsdp_param] = fsdp_param.sharded_param.grad_dtype fsdp_param.sharded_param.grad_dtype = None self._modelopt_original_post_backward() # Cast gradients to parameter dtype so the optimizer sees matching dtypes for fsdp_param in self.fsdp_params: sp = fsdp_param.sharded_param if sp.grad is not None and sp.grad.dtype != sp.dtype: sp.grad = sp.grad.to(sp.dtype) + if fsdp_param in saved_grad_dtypes: + sp.grad_dtype = saved_grad_dtypes[fsdp_param]
- Exception safety. If
_modelopt_original_post_backward()raises, the gradient casting loop (lines 142–145) is skipped. Atry/finallywould ensure cleanup, though the risk is low for a temporary workaround.
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
What does this PR do?
Testing
CI/CD in this PR should pass
Summary by CodeRabbit
New Features
Documentation
Chores