Skip to content

FP8BlockQuantizer not work on TE #2393

@fy1214

Description

@fy1214

Describe the bug

  1. Issues Caused by Enabling Tensor Parallelism (TP)
    Phenomenon: When enabling fp8_recipe == blockwise with TP:
  • Error: AssertionError: All-gather requires quantizable tensor for quantizer Float8BlockQuantizer
  • If this assertion is commented out and SP (Sequence Parallelism) is disabled:
    Error: ValueError: When using expert parallelism and tensor parallelism, sequence parallelism must be used
  • If this assertion is commented out and SP is enabled:
    Error: CUDA Error: an illegal memory access was encountered
    Temporary Solution: Disable TP for MOE training.
Image
  1. Issues Caused by Enabling CPU Offload
    If optimizer_cpu_offload is enabled, it requires:
  • Not enabling --fp8-param-gather
  • And args.fp8_recipe == "delayed"
    Problem: If --fp8-param-gather is not enabled, FP8 training ends up requiring higher GPU memory than BF16 training.

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • Pytorch version 2.8.0+cu129
  • Python version 3.12
  • Transformer Engine version 2.8.0
  • CUDA version 12.9

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions