Fix LLaVA packed sequence cu_seqlens after image token expansion by aroshanghias-nvd · Pull Request #3427 · NVIDIA/Megatron-LM

aroshanghias-nvd · 2026-02-14T00:43:23Z

Summary

When sequence packing is used with LLaVA, PackedSeqParams.cu_seqlens boundaries arrive in collapsed space (1 token per image), but _preprocess_data expands each image placeholder into img_seq_len embedding tokens. The cu_seqlens were never updated to reflect this expansion, causing attention kernels to see incorrect sub-sequence boundaries — corrupting attention masking, RoPE positions, and CP/SP sharding for every packed sub-sequence.

Changes

Add _remap_packed_seq_params_after_image_expansion() to LLaVAModel which translates all cu_seqlens entries from collapsed to expanded space using the same cumulative-sum position mapping as _preprocess_data.
Call it in forward() right after _preprocess_data and before the language model.
Add test_forward_packed_vs_batched parameterized over num_samples={1, 4, 16} that compares batched-unpacked vs packed forward passes using token_mult_prob_error.

Test results

num_samples	num_images	total_expanded	Without fix	With fix
1	1	597	1.1032	1.000000
4	3	1,889	1.1175	1.000000
16	5	3,565	1.1319	1.000000

Test uses a real LLaVAModel (CLIP vision encoder + MLP projection + 3-layer GPT) on GPU with TE attention kernels — no mocking.

Test plan

New test test_forward_packed_vs_batched fails without the fix (token_mult_prob_error >> 1.02)
New test passes with the fix (token_mult_prob_error = 1.000000)
Existing test_forward, test_preprocess_data still pass
CI passes

When sequence packing is used with LLaVA, PackedSeqParams.cu_seqlens boundaries arrive in collapsed space (1 token per image), but _preprocess_data expands each image placeholder into img_seq_len embedding tokens. The cu_seqlens were never updated to reflect this expansion, causing attention kernels to see incorrect sub-sequence boundaries, corrupting attention masking, RoPE positions, and CP/SP sharding for every packed sub-sequence. Add _remap_packed_seq_params_after_image_expansion() to LLaVAModel which translates all cu_seqlens entries from collapsed to expanded space using the same cumulative-sum position mapping as _preprocess_data. Call it in forward() right after _preprocess_data and before the language model. Add test_forward_packed_vs_batched parameterized over num_samples= {1,4,16} that compares batched-unpacked vs packed forward passes using token_mult_prob_error (same metric as NeMo-RL). Without the fix: 1.10-1.13. With the fix: 1.000000. Signed-off-by: Ali Roshan Ghias <aliroshanghias@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

copy-pr-bot · 2026-02-14T00:43:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Remove unnecessary intermediate variable _image_token_index and pass the expression directly to _preprocess_data for cleaner code. Signed-off-by: Ali Roshan Ghias <aliroshanghias@nvidia.com>

…overage The cleanup commit c836295 removed the _image_token_index local variable but forgot to update its usage in the _remap_packed_seq_params call at line 1002, causing a NameError on any packed-sequence forward pass with images. Restore the local variable so both _preprocess_data and _remap use it. Also fix test_forward's cu_seqlens which exceeded the collapsed input length (previously unreachable), and enhance test_forward_packed_vs_batched with variable tiles per image ([1,2,1,3,1] cycle) and multi-image samples. Signed-off-by: Ali Roshan Ghias <aliroshanghias@nvidia.com>

aroshanghias-nvd · 2026-02-17T23:14:16Z

Let's drop this idea for now. It turns out that Megatron-LM is handling CP and sequence packing alright, because it uses Energon, and Energon does image token handling at data preparation. So if we update MCore, it will break MLM SFT!

aroshanghias-nvd requested review from a team as code owners February 14, 2026 00:43

ko3n1g requested a review from a team February 14, 2026 00:43

Ali Roshan Ghias added 2 commits February 13, 2026 17:14

Inline image_token_index variable in forward method

c836295

Remove unnecessary intermediate variable _image_token_index and pass the expression directly to _preprocess_data for cleaner code. Signed-off-by: Ali Roshan Ghias <aliroshanghias@nvidia.com>

aroshanghias-nvd closed this Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LLaVA packed sequence cu_seqlens after image token expansion#3427

Fix LLaVA packed sequence cu_seqlens after image token expansion#3427
aroshanghias-nvd wants to merge 3 commits intoNVIDIA:mainfrom
aroshanghias-nvd:fix/llava-packed-seq-cu-seqlens

aroshanghias-nvd commented Feb 14, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Feb 14, 2026

Uh oh!

aroshanghias-nvd commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aroshanghias-nvd commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test results

Test plan

Uh oh!

copy-pr-bot Bot commented Feb 14, 2026

Uh oh!

aroshanghias-nvd commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aroshanghias-nvd commented Feb 14, 2026 •

edited

Loading