Skip to content

Fix ZeRO-3: Use per-param dtype for output buffers in _allgather_params_coalesced#8073

Open
albertvillanova wants to merge 1 commit into
deepspeedai:masterfrom
albertvillanova:fix-8072
Open

Fix ZeRO-3: Use per-param dtype for output buffers in _allgather_params_coalesced#8073
albertvillanova wants to merge 1 commit into
deepspeedai:masterfrom
albertvillanova:fix-8072

Conversation

@albertvillanova

@albertvillanova albertvillanova commented Jun 17, 2026

Copy link
Copy Markdown

This PR fixes the _allgather_params_coalesced method in partition_parameters.py. The change ensures that each flat_tensor is created with the correct data type by referencing the corresponding parameter in param_list, rather than always using the first parameter's data type.

Fix #8072.

Problem

_allgather_params_coalesced allocates all output buffers using the dtype of the first parameter in param_list:

# before
for psize in partition_sizes:
    flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype, ...)

This assumed every persistent parameter shares the same dtype. The assumption was incidentally maintained before 0.19.2 because _configure_distributed_model called module.bfloat16() unconditionally, normalising all persistent parameters (including PEFT LoRA adapters) to a uniform dtype.

PR #8066 "Mixed-precision: per-policy param/buffer dtype cast (preserve fp32 buffers)" (commit b919284) correctly stopped casting ZeRO-Init model params, but exposed the latent bug: PEFT's default autocast_adapter_dtype=True keeps LoRA adapters in fp32 even when the base model is bf16. persistent_parameters therefore ends up with mixed dtypes (bf16 base-model params + fp32 LoRA params), and the mismatch between a bf16 output buffer and a fp32 input tensor raises:

TypeError: output tensor must have the same type as input tensor

Solution

Allocate each output buffer with the dtype of its own parameter:

# after
for i, psize in enumerate(partition_sizes):
    flat_tensor = torch.empty(tensor_size, dtype=param_list[i].ds_tensor.dtype, ...)

This removes the shared-dtype assumption at the source rather than relying on upstream callers to normalise dtypes before calling _allgather_params_coalesced.

Changes

  • Corrected tensor data type selection in _allgather_params_coalesced to use the data type of each parameter in param_list, ensuring proper handling of mixed data types.

Signed-off-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
@albertvillanova albertvillanova changed the title Fix zero3: Use per-param dtype for output buffers in _allgather_params_coalesced Fix ZeRO-3: Use per-param dtype for output buffers in _allgather_params_coalesced Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] ZeRO-3 + PEFT (LoRA) regression in 0.19.2: TypeError: output tensor must have the same type as input tensor in _allgather_params_coalesced

1 participant