System Info
NA
Who can help?
No response
Information
Tasks
Reproduction
use HF model that has sharded params. for example nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4.
Expected behavior
linear op params should be initialized correctly
actual behavior
linear op params aren't initialized correctly
additional notes
When AD deals with quantized here it expects all parameters to be available in the same state_dict. when a linear op's params are sharded across safetensors, this means the condition isn't triggered, and the params for that linear aren't initialized.
Need to make the load_hook robust to these kind of scenearios
Before submitting a new issue...
System Info
NA
Who can help?
No response
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
use HF model that has sharded params. for example nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4.
Expected behavior
linear op params should be initialized correctly
actual behavior
linear op params aren't initialized correctly
additional notes
When AD deals with quantized here it expects all parameters to be available in the same state_dict. when a linear op's params are sharded across safetensors, this means the condition isn't triggered, and the params for that linear aren't initialized.
Need to make the load_hook robust to these kind of scenearios
Before submitting a new issue...