When using context parallel to fine-tune gpt-oss, no attention backend is supported for this configuration. I have to change cp_comm_type to a2a to enable FusedAttention. But this is potentially less efficient than a2a at large context length (say 128k).
When using context parallel to fine-tune gpt-oss, no attention backend is supported for this configuration. I have to change cp_comm_type to a2a to enable FusedAttention. But this is potentially less efficient than a2a at large context length (say 128k).