Skip to content

[BUG] SIGSEGV in fused_attn_fwd when cu_seqlens_q != cu_seqlens_q_padded with FP8 blockwise #2892

@NoonePauseferg

Description

@NoonePauseferg

Environment

  • TransformerEngine: 2.12.0
  • Megatron-Core: 0.16.0
  • PyTorch: 2.9.1
  • CUDA: 12.9
  • GPU: H100

Description

DotProductAttention with thd (packed sequence) format crashes with SIGSEGV when cu_seqlens_q differs from cu_seqlens_q_padded and FP8 blockwise recipe is used.

Per the TE documentation, cu_seqlens_q should contain actual token boundaries while cu_seqlens_q_padded contains the padded memory layout boundaries. This is needed when there is padding between sequences in a packed batch (e.g., for FP8 alignment).

Expected Behavior

Attention should:

  1. Use cu_seqlens_q for attention masking (only attend to real tokens)
  2. Use cu_seqlens_q_padded for memory layout (tensor indexing)
  3. Return output tensor in padded layout (same shape as input Q)

Actual Behavior

  • FP8 blockwise: SIGSEGV in tex.fused_attn_fwd C++ kernel
  • FP8 delayed: Output tensor has unpadded size instead of padded, causing downstream shape mismatches
  • BF16 (no FP8): Works correctly when cu_seqlens differ (non-FP8 backends handle it)

Impact

This bug makes FP8 training with sequence packing unusable when FP8 alignment padding is needed. The VERL framework (volcengine/verl) adds FP8 alignment padding for TE compatibility but passes identical cu_seqlens_q and cu_seqlens_q_padded as a workaround. This causes padding tokens to be visible to attention, corrupting the model output (training perplexity goes from 3.7 to 3055).

Reproduction

import torch
import transformer_engine.pytorch as te
from transformer_engine.pytorch.attention import DotProductAttention

# Setup
batch_size = 4
real_seqlens = [100, 150, 120, 130]  # actual sequence lengths
padded_seqlens = [112, 160, 128, 144]  # padded to 16-byte alignment for FP8
total_padded = sum(padded_seqlens)
hidden_dim = 1536
num_heads = 32
head_dim = hidden_dim // num_heads

# Create cu_seqlens
cu_seqlens_q = torch.tensor([0] + list(torch.cumsum(torch.tensor(real_seqlens), 0)),
                             dtype=torch.int32, device='cuda')
cu_seqlens_q_padded = torch.tensor([0] + list(torch.cumsum(torch.tensor(padded_seqlens), 0)),
                                    dtype=torch.int32, device='cuda')

# Create Q, K, V in thd format (padded layout)
q = torch.randn(total_padded, num_heads, head_dim, device='cuda', dtype=torch.bfloat16)
k = torch.randn(total_padded, num_heads, head_dim, device='cuda', dtype=torch.bfloat16)
v = torch.randn(total_padded, num_heads, head_dim, device='cuda', dtype=torch.bfloat16)

# This works (cu_seqlens_q == cu_seqlens_q_padded):
attn = DotProductAttention(num_heads, head_dim, head_dim)
with te.fp8_autocast(enabled=True, fp8_recipe=te.recipe.BlockScaling()):
    out = attn(q, k, v,
               qkv_format='thd',
               cu_seqlens_q=cu_seqlens_q_padded,  # same as padded
               cu_seqlens_kv=cu_seqlens_q_padded,
               cu_seqlens_q_padded=cu_seqlens_q_padded,
               cu_seqlens_kv_padded=cu_seqlens_q_padded,
               max_seqlen_q=max(padded_seqlens),
               max_seqlen_kv=max(padded_seqlens),
               attn_mask_type='causal')

# This CRASHES (cu_seqlens_q != cu_seqlens_q_padded):
with te.fp8_autocast(enabled=True, fp8_recipe=te.recipe.BlockScaling()):
    out = attn(q, k, v,
               qkv_format='thd',
               cu_seqlens_q=cu_seqlens_q,  # actual boundaries
               cu_seqlens_kv=cu_seqlens_q,
               cu_seqlens_q_padded=cu_seqlens_q_padded,  # padded boundaries
               cu_seqlens_kv_padded=cu_seqlens_q_padded,
               max_seqlen_q=max(real_seqlens),
               max_seqlen_kv=max(real_seqlens),
               attn_mask_type='causal')
# SIGSEGV ^^^

Core Problem: cu_seqlens_q used for BOTH attention masking AND SP communication

When cu_seqlens_q != cu_seqlens_q_padded:

  • DotProductAttention: Works correctly in isolation (masks padding, returns padded-size output)
  • ColumnParallelLinear with sequence_parallel=True: Uses cu_seqlens_q for allgather sizing → allgathered tensor gets UNPADDED size → breaks downstream RoPE/Linear that expect padded size

This means TE uses cu_seqlens_q for two conflicting purposes:

  1. Attention masking (should use actual/unpadded boundaries)
  2. SP allgather/scatter (should use padded boundaries)

Proposed fix: TE should separate these two uses. SP communication should ALWAYS use cu_seqlens_q_padded for sizing, while attention masking should use cu_seqlens_q.

Impact on Training

With VERL framework training DeepSeek 10B MoE with Megatron-Core:

  • BF16 baseline: grad_norm=0.28, training_ppl=3.7
  • FP8 E2E (cu_seqlens_q == cu_seqlens_q_padded): grad_norm=130-500, training_ppl=3055 (garbage)
  • BF16 + FP8 padding only (no FP8 compute): grad_norm=1064, training_ppl=6.6 (proves padding is the cause)

Related Issues

repro_fp8_padding_bug.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions