FP8BlockQuantizer not work on TE

**Describe the bug**

1. Issues Caused by Enabling Tensor Parallelism (TP)
Phenomenon: When enabling fp8_recipe == blockwise with TP:
- Error: AssertionError: All-gather requires quantizable tensor for quantizer Float8BlockQuantizer
- If this assertion is commented out and SP (Sequence Parallelism) is disabled:
Error: ValueError: When using expert parallelism and tensor parallelism, sequence parallelism must be used
- If this assertion is commented out and SP is enabled:
Error: CUDA Error: an illegal memory access was encountered
Temporary Solution: Disable TP for MOE training.

<img width="965" height="723" alt="Image" src="https://github.com/user-attachments/assets/d40b7059-eb54-4329-8882-e2bff823f756" />


2. Issues Caused by Enabling CPU Offload
If optimizer_cpu_offload is enabled, it requires:
- Not enabling --fp8-param-gather
- And args.fp8_recipe == "delayed"
Problem: If --fp8-param-gather is not enabled, FP8 training ends up requiring higher GPU memory than BF16 training.

**Environment details**

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- Pytorch version 2.8.0+cu129
- Python version 3.12
- Transformer Engine version 2.8.0
- CUDA version 12.9


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FP8BlockQuantizer not work on TE #2393

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FP8BlockQuantizer not work on TE #2393

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions