Skip to content

Fix LLaVA packed sequence cu_seqlens after image token expansion#3427

Closed
aroshanghias-nvd wants to merge 3 commits intoNVIDIA:mainfrom
aroshanghias-nvd:fix/llava-packed-seq-cu-seqlens
Closed

Fix LLaVA packed sequence cu_seqlens after image token expansion#3427
aroshanghias-nvd wants to merge 3 commits intoNVIDIA:mainfrom
aroshanghias-nvd:fix/llava-packed-seq-cu-seqlens

Conversation

@aroshanghias-nvd
Copy link
Copy Markdown

@aroshanghias-nvd aroshanghias-nvd commented Feb 14, 2026

Summary

When sequence packing is used with LLaVA, PackedSeqParams.cu_seqlens boundaries arrive in collapsed space (1 token per image), but _preprocess_data expands each image placeholder into img_seq_len embedding tokens. The cu_seqlens were never updated to reflect this expansion, causing attention kernels to see incorrect sub-sequence boundaries — corrupting attention masking, RoPE positions, and CP/SP sharding for every packed sub-sequence.

Changes

  • Add _remap_packed_seq_params_after_image_expansion() to LLaVAModel which translates all cu_seqlens entries from collapsed to expanded space using the same cumulative-sum position mapping as _preprocess_data.
  • Call it in forward() right after _preprocess_data and before the language model.
  • Add test_forward_packed_vs_batched parameterized over num_samples={1, 4, 16} that compares batched-unpacked vs packed forward passes using token_mult_prob_error.

Test results

num_samples num_images total_expanded Without fix With fix
1 1 597 1.1032 1.000000
4 3 1,889 1.1175 1.000000
16 5 3,565 1.1319 1.000000

Test uses a real LLaVAModel (CLIP vision encoder + MLP projection + 3-layer GPT) on GPU with TE attention kernels — no mocking.

Test plan

  • New test test_forward_packed_vs_batched fails without the fix (token_mult_prob_error >> 1.02)
  • New test passes with the fix (token_mult_prob_error = 1.000000)
  • Existing test_forward, test_preprocess_data still pass
  • CI passes

When sequence packing is used with LLaVA, PackedSeqParams.cu_seqlens
boundaries arrive in collapsed space (1 token per image), but
_preprocess_data expands each image placeholder into img_seq_len
embedding tokens. The cu_seqlens were never updated to reflect this
expansion, causing attention kernels to see incorrect sub-sequence
boundaries, corrupting attention masking, RoPE positions, and CP/SP
sharding for every packed sub-sequence.

Add _remap_packed_seq_params_after_image_expansion() to LLaVAModel
which translates all cu_seqlens entries from collapsed to expanded
space using the same cumulative-sum position mapping as
_preprocess_data. Call it in forward() right after _preprocess_data
and before the language model.

Add test_forward_packed_vs_batched parameterized over num_samples=
{1,4,16} that compares batched-unpacked vs packed forward passes
using token_mult_prob_error (same metric as NeMo-RL). Without the
fix: 1.10-1.13. With the fix: 1.000000.

Signed-off-by: Ali Roshan Ghias <aliroshanghias@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@aroshanghias-nvd aroshanghias-nvd requested review from a team as code owners February 14, 2026 00:43
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ko3n1g ko3n1g requested a review from a team February 14, 2026 00:43
Ali Roshan Ghias added 2 commits February 13, 2026 17:14
Remove unnecessary intermediate variable _image_token_index and pass the expression directly to _preprocess_data for cleaner code.

Signed-off-by: Ali Roshan Ghias <aliroshanghias@nvidia.com>
…overage

The cleanup commit c836295 removed the _image_token_index local variable
but forgot to update its usage in the _remap_packed_seq_params call at line
1002, causing a NameError on any packed-sequence forward pass with images.

Restore the local variable so both _preprocess_data and _remap use it.

Also fix test_forward's cu_seqlens which exceeded the collapsed input length
(previously unreachable), and enhance test_forward_packed_vs_batched with
variable tiles per image ([1,2,1,3,1] cycle) and multi-image samples.

Signed-off-by: Ali Roshan Ghias <aliroshanghias@nvidia.com>
@aroshanghias-nvd
Copy link
Copy Markdown
Author

Let's drop this idea for now. It turns out that Megatron-LM is handling CP and sequence packing alright, because it uses Energon, and Energon does image token handling at data preparation. So if we update MCore, it will break MLM SFT!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant