-
Notifications
You must be signed in to change notification settings - Fork 597
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
- Issues Caused by Enabling Tensor Parallelism (TP)
Phenomenon: When enabling fp8_recipe == blockwise with TP:
- Error: AssertionError: All-gather requires quantizable tensor for quantizer Float8BlockQuantizer
- If this assertion is commented out and SP (Sequence Parallelism) is disabled:
Error: ValueError: When using expert parallelism and tensor parallelism, sequence parallelism must be used - If this assertion is commented out and SP is enabled:
Error: CUDA Error: an illegal memory access was encountered
Temporary Solution: Disable TP for MOE training.
- Issues Caused by Enabling CPU Offload
If optimizer_cpu_offload is enabled, it requires:
- Not enabling --fp8-param-gather
- And args.fp8_recipe == "delayed"
Problem: If --fp8-param-gather is not enabled, FP8 training ends up requiring higher GPU memory than BF16 training.
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- Pytorch version 2.8.0+cu129
- Python version 3.12
- Transformer Engine version 2.8.0
- CUDA version 12.9
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working