[nvbug 6289151] Fix exported Step layer type metadata#1693
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a sanitizer that trims Hugging Face ChangesLayer-types sanitization for HF export
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1693 +/- ##
==========================================
+ Coverage 67.72% 76.43% +8.71%
==========================================
Files 511 511
Lines 56168 56950 +782
==========================================
+ Hits 38037 43530 +5493
+ Misses 18131 13420 -4711
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
0bccd7b to
97687b3
Compare
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Small, focused bug fix (+121/-5, 4 files) for Step-3.5-Flash HF export where layer_types includes trailing MTP/next-token-prediction entries, breaking Transformers 5.x's len(layer_types) == num_hidden_layers validation in TRT-LLM.
Correctness:
sanitize_hf_config_for_deploymentis conservative: it only trims trailing entries when the mismatch is exactly explained bynum_nextn_predict_layers, and leaves unexplained mismatches untouched._as_nonnegative_intcorrectly excludesbool(which is anintsubclass) before the integer check._get_num_nextn_predict_layersfalls back config_data → model.config →_mtp_layer_prefixeslength; when it returnsNone,None == <int>isFalse, so no trim happens. Anum_nextn_predict_layers == 0value can never trigger a trim because the equal-length case already returns early. All edges consistent.- Integration point in
export_hf_checkpointis correct: called aftersave_pretrainedwritesconfig.jsonand before the config is re-written to disk; trimminglayer_types[:num_hidden_layers]keeps the head decoder entries, matching the documented MTP-entries-come-last assumption. - The
plugins/__init__.pyreorder (movinghf_checkpoint_utilsstar-import ahead ofvllm_fakequant_hf) is benign; both are simple plugin star-imports with no cross-dependency. Private helpers (_as_nonnegative_int,_get_num_nextn_predict_layers) won't leak viaimport *.
Tests: Three unit tests cover the config-nextn-count trim, the model.config-nextn-count trim, and the keep-unexplained-mismatch case, including warning emission. Meaningful coverage of the core branches.
No licensing changes, no new subsystem/abstraction, no prompt-injection attempts in the PR metadata. Full TRT-LLM serve wasn't run (checkpoint not visible from cluster), which is an acceptable limitation noted by the author for a config-only sanitizer with unit coverage.
Complex PR: 1 existing test file modified or removed. Looping in a human for approval.
97687b3 to
914cf1e
Compare
| from .model_utils import get_language_model_from_vl, is_multimodal_model | ||
| from .moe_utils import _export_fused_experts | ||
| from .plugins import SpeculativeDecodingExporter, has_spec_opt | ||
| from .plugins import SpeculativeDecodingExporter, has_spec_opt, sanitize_hf_config_for_deployment |
There was a problem hiding this comment.
Hard import of a symbol from an optionally-guarded plugin.
This now hard-imports sanitize_hf_config_for_deployment, which lives in hf_checkpoint_utils. In plugins/__init__.py that module is imported inside import_plugin(...) — a guard whose whole purpose is to tolerate the module failing to import. By contrast, SpeculativeDecodingExporter / has_spec_opt come from hf_spec_export, which is imported unguarded.
So if hf_checkpoint_utils ever fails under the guard, the symbol silently won't exist and this line raises ImportError, breaking all HF export — not just the Step path. In practice the deps (huggingface_hub, safetensors, tqdm) are always present for export, so the risk is low, but the import_plugin guard is partially defeated by taking a hard dependency on one of its symbols.
Note the import reorder in plugins/__init__.py doesn't close this gap — it only changes binding order; it doesn't make the symbol exist if the guarded import itself fails. Suggested fixes:
- guard this import defensively (
try/except ImportError→ no-op fallback), or - if
hf_checkpoint_utilsis genuinely required for export, drop theimport_pluginguard around it so failures are loud rather than silently amputating a symbol.
There was a problem hiding this comment.
Addressed in d83dcb8. I removed the import_plugin guard around hf_checkpoint_utils in plugins/init.py, so sanitize_hf_config_for_deployment is no longer a hard import from an optionally amputated plugin symbol. This matches the existing direct hf_checkpoint_utils imports in export code: failures are now loud instead of silently dropping the symbol. Validated with ruff-format, ruff-check, py_compile, and the focused hf_checkpoint_utils unit tests.
There was a problem hiding this comment.
Updated in 43a9abc. I changed the resolution from making hf_checkpoint_utils unguarded to preserving the import_plugin guard and adding a no-op fallback for sanitize_hf_config_for_deployment when the guarded import does not provide it. That keeps modelopt.torch.export import paths resilient to missing optional deps while keeping the direct unified_export_hf import stable. Validated with ruff-format, ruff-check, py_compile, and the focused hf_checkpoint_utils unit tests.
914cf1e to
d83dcb8
Compare
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review of Step-3.5-Flash HF export layer_types fix (now +159/-5, 4 files).
Previous critical comment (Edwardf0t1): RESOLVED. 💬 Author replied: removed the import_plugin guard around hf_checkpoint_utils in plugins/__init__.py, so sanitize_hf_config_for_deployment is now an unguarded import matching the existing hf_spec_export pattern; failures are loud rather than silently amputating the symbol. Verified in the current tree — the guard is gone and the import is now plain from .hf_checkpoint_utils import *. This matches the reviewer's suggested fix exactly, and the hard import in unified_export_hf.py:93 is now safe.
Correctness: The sanitizer is conservative — it only trims trailing layer_types when the surplus is exactly num_nextn_predict_layers, otherwise leaves the config untouched. _as_nonnegative_int correctly excludes bool. _get_num_nextn_predict_layers falls back config_data → model.config → _mtp_layer_prefixes count; the new _count_mtp_layer_prefixes correctly excludes broad prefixes ("mtp" → not counted, "mtp.layers.0" → counted). Integration point (after save_pretrained, before re-writing config.json) is correct.
Tests: Five unit tests now cover config-count trim, model.config-count trim, MTP-prefix counting, broad-prefix-only no-op, and the keep-unexplained-mismatch case — meaningful branch coverage.
No licensing changes; no prompt-injection in PR metadata.
Why nudge rather than approve: the resolution looks correct but warrants a quick human confirmation, and the author notes the full TRT-LLM serve path was not run (Step-3.5 checkpoint not visible from the cluster login node), so the end-to-end deployment fix is unverified — only unit coverage exists for the config-only sanitizer.
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/export/plugins/__init__.py`:
- Line 23: Restore guarded import semantics: replace the unconditional "from
.hf_checkpoint_utils import *" with a guarded import that uses the package's
import_plugin mechanism (or a try/except ImportError around importing
hf_checkpoint_utils) so third‑party optional extras don't cause hard import
failures; specifically, use
import_plugin("modelopt.torch.export.plugins.hf_checkpoint_utils") or wrap the
import of hf_checkpoint_utils in try/except, emit a warning when the extra is
missing, and ensure any names expected from hf_checkpoint_utils are either
imported when available or left undefined/placeholder so package import paths
remain resilient.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 74637461-2fc7-43aa-83c9-862d6ca48c18
📒 Files selected for processing (4)
modelopt/torch/export/plugins/__init__.pymodelopt/torch/export/plugins/hf_checkpoint_utils.pymodelopt/torch/export/unified_export_hf.pytests/unit/torch/export/test_hf_checkpoint_utils.py
🚧 Files skipped from review as they are similar to previous changes (3)
- modelopt/torch/export/plugins/hf_checkpoint_utils.py
- tests/unit/torch/export/test_hf_checkpoint_utils.py
- modelopt/torch/export/unified_export_hf.py
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
d83dcb8 to
43a9abc
Compare
What does this PR do?
Type of change: Bug fix
Fixes HF checkpoint export for Step-3.5-Flash-style configs where
layer_typesincludes main decoder layers plusnum_nextn_predict_layersentries. Transformers 5.x validates thatlen(layer_types) == num_hidden_layers, so TRT-LLM fails duringAutoConfig.from_pretrained()before loading the model.This PR trims only the trailing next-token-prediction/MTP
layer_typesentries when the mismatch is exactly explained bynum_nextn_predict_layers. It leaves unexplained mismatches unchanged.Usage
N/A
Testing
/Users/weimingc/miniconda3/envs/modelopt/bin/python -m py_compile modelopt/torch/export/unified_export_hf.py tests/unit/torch/export/test_unified_export_hf.py/Users/weimingc/miniconda3/envs/modelopt/bin/python -m pytest tests/unit/torch/export/test_unified_export_hf.py tests/unit/torch/export/test_nvfp4_utils.pygit diff --checkFull TRT-LLM serve was not run locally because the provided Step-3.5 checkpoint path is not visible from the configured cluster login node.
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
nvbug 6289151
Related MTP handling: #1532.
Summary by CodeRabbit
Bug Fixes
Tests
Chores