Skip to content

[Bug]: Support parameters sharding across safetensors #11541

@tcherckez-nvidia

Description

@tcherckez-nvidia

System Info

NA

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

use HF model that has sharded params. for example nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4.

Expected behavior

linear op params should be initialized correctly

actual behavior

linear op params aren't initialized correctly

additional notes

When AD deals with quantized here it expects all parameters to be available in the same state_dict. when a linear op's params are sharded across safetensors, this means the condition isn't triggered, and the params for that linear aren't initialized.

Need to make the load_hook robust to these kind of scenearios

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Disaggregated serving<NV>Deploying with separated, distributed components (params, kv-cache, compute). Arch & perf.bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions